Daemon is not updating Job States

Hi!
I’m using AiiDA v2.4.0 and have a bunch of jobs that are supposed to run on a remote computer. Ultimately, for now 2 days, the job statuses remain unchanged. Most jobs are shown by the scheduler to have the job state UNKNOWN, almost all of which by now have in actuality already been submitted to the remote computer and have successfully finished. Other tasks are considered to be in the “upload” phase, but no remote directory is being created, and nothing is changing.

Until now I have tried the following:

  1. Restart the daemon (either with verdi daemon restart, stop && start, incr & decr, etc.)
    In the log, this initially leads to infos such as
aiida.engine.processes.calcjobs.tasks: [INFO] scheduled request to upload CalcJob<164477>
aiida.engine.processes.calcjobs.manager: [INFO] waiting for transport

but after about a minute of these log messages, it quickly only becomes:

06/24/2024 08:23:23 AM <887392> aiida.querybuilder: [DEBUG] Adding projection of authinfo_1: *
06/24/2024 08:23:23 AM <887392> aiida.querybuilder: [DEBUG] projections have become: [{'*': {}}]
06/24/2024 08:23:23 AM <887392> aiida.orm.querybuilder: [DEBUG] projections data: {'authinfo_1': [{'*': {}}]}
06/24/2024 08:23:23 AM <887392> aiida.orm.querybuilder: [DEBUG] projection for authinfo_1: [{'*': {}}]

This is then the only message I’m getting from the daemon, which updates approximately every 10 seconds
2. verdi devel rabbitmq analyze --fix
While this did restart an ancient calculation, it had no effect on the current jobs.

Is there anything I can still do/try?
Thanks for all the help!

Hi @NicolasBergmann , this is indeed weird. My first guess is that there might be a problem with opening a connection to the remote computer. Could you run

verdi computer <computer> test

replacing <computer> with the PK or label of the Computer to which the calculations were submitted.

Also, could you look at the output of verdi process report for any of the stuck calculations. If there are problems with connections, it should have an error message in there.

Hey @sphuber
Thanks for the quick response and suggestions!
So the verdi computer test showed no problems, just a suggestion to not use the login shell as it is slower than the normal shell.
To the “stuck” jobs, the report is:

*** 164306 [SCF: H4Cu96O4, charge=0.00, hkl=100, struc=160546]: CalcJobState.WITHSCHEDULER, scheduler state: (unknown)
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 0 LOG MESSAGES

Ok, that is really weird. Have you tried submitting a new calculation to the same computer just to see if that still works?

What also would be useful is if you stop the daemon and then run

verdi --verbosity info daemon worker

This will run a daemon runner in the foreground, so it will block, and print log information on what it is doing. Could you share the output here?

daemon_worker_verbosity_info.txt (28.8 KB)
So I’m running the verbose worker, this is the output. It has been stuck at the “Pool recreating” step for about 10 minutes.

In terms of submitting a new calculation, the new calculations are picked up by the scheduler, but it does not look like remote work directories are being created.

Ok thanks, that gives some more information. Those last log messages

Info: SELECT db_dbsetting.val 
FROM db_dbsetting 
WHERE db_dbsetting.key = %(key_1)s
Info: [generated in 0.00014s] {'key_1': 'repository|uuid'}
Info: ROLLBACK
Info: Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
Info: Pool recreating

are coming from sqlalchemy. It is fetching the repository uuid setting which is almost only done when validating the connection to the database. For some reason there is then a ROLLBACK which seems to indicate there is an error in that query, but there is no info. It then says the pool is disposed and wants to recreate it, confirming the idea that the connection to the database failed somehow and now fails to reestablish it.

This surprises me though, since you can get verdi process list etc to run, which would also require connecting to the database. Just to make sure, verdi status shows no problem with the database?

What version of sqlalchemy is installed? Maybe run pip freeze | grep -i alchemy. Maybe an incompatible version somehow got installed into your environment. Perhaps reinstalling aiida-core or better even, updating to v2.5 would perhaps fix the problem.

Verdi Status gives this response:

 ✔ version:     AiiDA v2.4.0
 ✔ config:      /work/home/nbergmann/AIIDA/envs/wfl0/.aiida
 ✔ profile:     wfl0
 ✔ storage:     Storage for 'wfl0' [open] @ postgresql://aiida_qs_nbergmann_e6a6e5adddc4e78e7791f45b6b4eac88:***@localhost:5432/wfl0_nbergmann_e6a6e5adddc4e78e7791f45b6b4eac88 / DiskObjectStoreRepository: 7bd30b31dd8f4943a3063790d2db318c | /work/home/nbergmann/AIIDA/envs/wfl0/.aiida/repository/wfl0/container
 ✔ rabbitmq:    Connected to RabbitMQ v3.8.2 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ⏺ daemon:      The daemon is not running.

As far as I can see there is no error.

The sqlalchemy version is 1.4.48.

There has been an update though in the job states! So it seems like something is happening…