SQLAlchemy Error After an Abnormal Restart

Hi all,
After an abnormal computer restart, I found that I cannot properly run any workflows, and the existing node data remains accessible.
Submitted workflows have a high probability of getting stuck. I’ve uploaded two error messages in attachment.
daemon_error2.log.txt (18.3 KB)
daemon_error.log.txt (41.6 KB)

The verdi daemon stop and verdi process repair has a small chance of restoring workflow operation. But in most cases, the workflow fails to interact with verdi process kill commands (It will give Error: Process<14726> is unreachable). I have to delete it using verdi node delete.
I suspect there might be database issues, but cannot pinpoint the exact source. Wondering if others have encountered similar problems or can provide reliable solutions (such as how to rebuild databases).

By the way I use the Postgresql database and here is my verdi status

:check_mark: version: AiiDA v2.6.3
:check_mark: config: /Users/yuhao/.aiida
:check_mark: profile: presto
:check_mark: storage: Storage for ‘presto’ [open] @ postgresql://aiida-presto:***@localhost:5432/aiida-presto / DiskObjectStoreRepository: 726fe431f0fa4dee9997b0d835eec29e | /Users/yuhao/.aiida/repository/presto/container
Warning: RabbitMQ v4.0.7 is not supported and will cause unexpected problems!
Warning: It can cause long-running workflows to crash and jobs to be submitted multiple times.
Warning: See RabbitMQ version to use · aiidateam/aiida-core Wiki · GitHub for details.
:check_mark: broker: RabbitMQ v4.0.7 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600
/opt/homebrew/Caskroom/miniforge/base/envs/aiida-develop/lib/python3.11/site-packages/paramiko/pkey.py:82: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
“cipher”: algorithms.TripleDES,
/opt/homebrew/Caskroom/miniforge/base/envs/aiida-develop/lib/python3.11/site-packages/paramiko/transport.py:253: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
“class”: algorithms.TripleDES,
:check_mark: daemon: Daemon is running with PID 98342

Hi @YuhaoJiang :waving_hand:

I think this refers, to a complicated bug in verdi process kill

Fortunately, we fix the bug in our main branch via this PR; soon to be released with 2.7.
You can try out the pre-release already through pip aiida-core 2.7.0rc1

If after trying out this fix, still verdi process kill was unsuccessful, try verdi process kill -F to forcefully kill the process. That will detach AiiDA of “being worried” about the job status on your remote HPC, and will just consider the job as failed, without trying to kill it on the remote anymore.

Thanks Ali,

But I’m not sure if these are the same issue, but in my case the verdi process kill immediately give the

Error: Process<14726> is unreachable.

and I can not use verdi process repair to fix it, the error messages are as attachments.
The key issue is that my workflow will stuck at state Created, and seems that the workflow can not be saved in database.

verdi process list (aiida-develop)
PK Created Process label :recycling_symbol: Process State Process status


14726 8h ago PwBandsWorkChain ⏵ Running
14733 8h ago PwBaseWorkChain :stop_button: Created

For the process that is stuck in the Created state, try to run this:

from aiida.manage import get_manager
from aiida import load_profile

load_profile()

pk = 14733 # process in `Created` state
process_controller = get_manager().get_process_controller()
process_controller.continue_process(pk)

Still don’t work :slightly_frowning_face:

In [3]:
…: from aiida.manage import get_manager
…: from aiida import load_profile
…:
…: load_profile()
…:
…: pk = 14733 # process in Created state
…: process_controller = get_manager().get_process_controller()
…: process_controller.continue_process(pk)
Out[3]: <Future at 0x16ab4b490 state=pending>

But nothing happened in daemon log, and the process still stuck at Created.
After restart the daemon and verdi process repair, the daemon log still report the similar error message
daemon_error3.log.txt (18.9 KB)

For almost every Workflow I created will face the same stuck, I wonder if there is something went wrong with my database?

From your logs, indeed there are a lot of database errors. Many I think are indirectly related to the main problem, but there are a few InternalErrors, so indeed I think your DB got corrupted.

Hopefully, it’s still recoverable (did you run backups until now? AiiDA now has a built-in backup command)

In particular from e.g. this error:

(psycopg2.errors.IndexCorrupted) table tid from new index tuple (1135,10) overlaps with invalid duplicate tuple at offset 18 of block 9 in index "ix_db_dbnode_db_dbnode_process_type"

It seems the index got corrupted (if it’s just the index, that’s not too bad - let’s hope the data is still here.

I recommend stopping any daemon and any connected section (Jupyter notebook loading a profile, verdi shell, …).

Then you can try to connect to your database (in ubuntu sudo su - postgres and then run the psql command (e.g. psql template1). Careful about the commands, as you risk to delete everything!

You can try e.g. the command below with the correct table name.
https://stackoverflow.com/questions/76251129/how-to-fix-postgresql-errors-telling-tid-from-new-index-tuple-overlaps-with-inva

Another option (I would even start from this) is to dump the DB to 1. have a backup and 2. if the error does not get solved by rebuilding all indexes, you can restore it from the dump. Follow e.g.

While you’re trying the method suggested by @giovannipizzi, it might also be worth a quick attempt to delete the problematic node (PK 14733) and restart the daemon. The logs indicate the daemon encountered an error with that node, and since it’s likely not recoverable anyway, removing it may help.

Thanks Giovanni, Xing and Ali!

After reindex the database, I have tried few tests and they can finish normally. :smiley:

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.