Daemon is not updating Job States

NicolasBergmann · June 24, 2024, 8:33am

Hi!
I’m using AiiDA v2.4.0 and have a bunch of jobs that are supposed to run on a remote computer. Ultimately, for now 2 days, the job statuses remain unchanged. Most jobs are shown by the scheduler to have the job state UNKNOWN, almost all of which by now have in actuality already been submitted to the remote computer and have successfully finished. Other tasks are considered to be in the “upload” phase, but no remote directory is being created, and nothing is changing.

Until now I have tried the following:

Restart the daemon (either with verdi daemon restart, stop && start, incr & decr, etc.)
In the log, this initially leads to infos such as

aiida.engine.processes.calcjobs.tasks: [INFO] scheduled request to upload CalcJob<164477>
aiida.engine.processes.calcjobs.manager: [INFO] waiting for transport

but after about a minute of these log messages, it quickly only becomes:

06/24/2024 08:23:23 AM <887392> aiida.querybuilder: [DEBUG] Adding projection of authinfo_1: *
06/24/2024 08:23:23 AM <887392> aiida.querybuilder: [DEBUG] projections have become: [{'*': {}}]
06/24/2024 08:23:23 AM <887392> aiida.orm.querybuilder: [DEBUG] projections data: {'authinfo_1': [{'*': {}}]}
06/24/2024 08:23:23 AM <887392> aiida.orm.querybuilder: [DEBUG] projection for authinfo_1: [{'*': {}}]

This is then the only message I’m getting from the daemon, which updates approximately every 10 seconds
2. verdi devel rabbitmq analyze --fix
While this did restart an ancient calculation, it had no effect on the current jobs.

Is there anything I can still do/try?
Thanks for all the help!

sphuber · June 24, 2024, 8:49am

Hi @NicolasBergmann , this is indeed weird. My first guess is that there might be a problem with opening a connection to the remote computer. Could you run

verdi computer <computer> test

replacing <computer> with the PK or label of the Computer to which the calculations were submitted.

Also, could you look at the output of verdi process report for any of the stuck calculations. If there are problems with connections, it should have an error message in there.

NicolasBergmann · June 24, 2024, 9:07am

Hey @sphuber
Thanks for the quick response and suggestions!
So the verdi computer test showed no problems, just a suggestion to not use the login shell as it is slower than the normal shell.
To the “stuck” jobs, the report is:

*** 164306 [SCF: H4Cu96O4, charge=0.00, hkl=100, struc=160546]: CalcJobState.WITHSCHEDULER, scheduler state: (unknown)
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 0 LOG MESSAGES

sphuber · June 24, 2024, 9:31am

Ok, that is really weird. Have you tried submitting a new calculation to the same computer just to see if that still works?

What also would be useful is if you stop the daemon and then run

verdi --verbosity info daemon worker

This will run a daemon runner in the foreground, so it will block, and print log information on what it is doing. Could you share the output here?

NicolasBergmann · June 24, 2024, 9:57am

daemon_worker_verbosity_info.txt (28.8 KB)
So I’m running the verbose worker, this is the output. It has been stuck at the “Pool recreating” step for about 10 minutes.

In terms of submitting a new calculation, the new calculations are picked up by the scheduler, but it does not look like remote work directories are being created.

sphuber · June 24, 2024, 10:11am

Ok thanks, that gives some more information. Those last log messages

Info: SELECT db_dbsetting.val 
FROM db_dbsetting 
WHERE db_dbsetting.key = %(key_1)s
Info: [generated in 0.00014s] {'key_1': 'repository|uuid'}
Info: ROLLBACK
Info: Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
Info: Pool recreating

are coming from sqlalchemy. It is fetching the repository uuid setting which is almost only done when validating the connection to the database. For some reason there is then a ROLLBACK which seems to indicate there is an error in that query, but there is no info. It then says the pool is disposed and wants to recreate it, confirming the idea that the connection to the database failed somehow and now fails to reestablish it.

This surprises me though, since you can get verdi process list etc to run, which would also require connecting to the database. Just to make sure, verdi status shows no problem with the database?

What version of sqlalchemy is installed? Maybe run pip freeze | grep -i alchemy. Maybe an incompatible version somehow got installed into your environment. Perhaps reinstalling aiida-core or better even, updating to v2.5 would perhaps fix the problem.

NicolasBergmann · June 24, 2024, 10:21am

Verdi Status gives this response:

 ✔ version:     AiiDA v2.4.0
 ✔ config:      /work/home/nbergmann/AIIDA/envs/wfl0/.aiida
 ✔ profile:     wfl0
 ✔ storage:     Storage for 'wfl0' [open] @ postgresql://aiida_qs_nbergmann_e6a6e5adddc4e78e7791f45b6b4eac88:***@localhost:5432/wfl0_nbergmann_e6a6e5adddc4e78e7791f45b6b4eac88 / DiskObjectStoreRepository: 7bd30b31dd8f4943a3063790d2db318c | /work/home/nbergmann/AIIDA/envs/wfl0/.aiida/repository/wfl0/container
 ✔ rabbitmq:    Connected to RabbitMQ v3.8.2 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ⏺ daemon:      The daemon is not running.

As far as I can see there is no error.

The sqlalchemy version is 1.4.48.

There has been an update though in the job states! So it seems like something is happening…

Topic		Replies	Views
CalcJob in `QUEUED` status even the actual job on HPC is finished General Usage	3	41	October 15, 2024
Aiida Daemon timing out New to AiiDA question	2	79	January 17, 2025
Calculations get stuck in "created" state General Usage	6	143	September 17, 2024
Process status is created only New to AiiDA	0	83	March 11, 2024
Problem in running in HPC: after relogin everything failed New to AiiDA	7	203	March 5, 2024

Daemon is not updating Job States

Related topics