CalcJob in `QUEUED` status even the actual job on HPC is finished

Xing · October 15, 2024, 3:32pm

Problem: the CalcJob processes are in the QUEUED status, but the actual jobs on HPC are finished
System: Ubuntu 20.04
AiiDA version: 2.6.2
Broker: RabbitMQ v3.8.2

Here are my processes:

$ verdi process list 
   PK  Created    Process label      ♻    Process State    Process status
-----  ---------  -----------------  ---  ---------------  --------------------------------------
15840  3D ago     Cp2kBaseWorkChain       ⏵ Waiting        Waiting for child processes: 15846
15845  3D ago     Cp2kBaseWorkChain       ⏵ Waiting        Waiting for child processes: 15852
15846  3D ago     Cp2kCalculation         ⏵ Waiting        Monitoring scheduler: job state QUEUED
15851  3D ago     Cp2kBaseWorkChain       ⏵ Waiting        Waiting for child processes: 15858
15852  3D ago     Cp2kCalculation         ⏵ Waiting        Monitoring scheduler: job state QUEUED
...

I checked the remote job using:

verdi calcjob gotocomputer 15846

And the job is finished. Actually, all the jobs on HPC are finished, but the CalcJob processes are in the QUEUED status.

I tried to pause all processes, and found the CalcJob processes are unreachable:

$ verdi process pause --all
Error: Process<15914> is unreachable.
Error: Process<15902> is unreachable.
Error: Process<15982> is unreachable.
Error: Process<15858> is unreachable.
Error: Process<15846> is unreachable.
...
Report: request to pause Process<15864> sent
Report: request to pause Process<15907> sent
Report: request to pause Process<16044> sent
Report: request to pause Process<15852> sent
Report: request to pause Process<15919> sent
Report: request to pause Process<15957> sent
...

I tried repair them using:

verdi daemon stop
verdi process repair
verdi daemon start
# then play all
verdi process play --all

But the process is still in the QUEUED status.

Any suggestion on how to trigger the process to RUNNING and then FINISHED?

sphuber · October 15, 2024, 6:17pm

Instead of trying to force trigger the update, I would first try to figure out what is going wrong.

First guess is that there might be a (temporary) problem with the transport (connection problems). If so, this would fail the scheduler update and it would start the exponential backoff mechanism. This won’t be immediately visible in the process list output though. Please check verdi process report if you see exception messages of the update task failing.

The best way to debug this is to increase the log level, stop the daemon, and then run a worker in a shell to get the output directly:

verdi config set logging.aiida_loglevel INFO
verdi daemon stop
verdi daemon worker

This will then hopefully give useful information on whether the worker actually picks up the jobs and then goes to the update task to refresh the job state.

Xing · October 15, 2024, 7:22pm

Hi @sphuber , thanks for the suggestion! The problem is solved. I increased the log level. I then play the process manually by using its pk

verdi process play pk

Then the log shows

Info: scheduled request to update CalcJob<15882>
Info: updating CalcJob<15882> successful
Info: Process<15882>: Broadcasting state change: state_changed.waiting.waiting
Info: scheduled request to retrieve CalcJob<15882>
Info: retrieving CalcJob<15882> successful

And the process finished successfully.

Also confirmed in the process report, there is a transport issue, because I have to update multi-factor authentication daily. However, the process status usually shows that the process is paused because the connection failed five times. But this time, it didn’t show this! It’s probably because I restarted the daemon, and the status message is somehow lost; I will report an issue if I encounter it.

system · October 21, 2024, 3:23pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to debug a workflow stuck in "Waiting" state General Usage question	25	234	January 24, 2024
Graceful kill - instruct a paused job to retrieve results General Usage question	7	99	November 29, 2023
Daemon is not updating Job States General Usage question	6	53	June 24, 2024
CalcJobs stuck in "Transport task update was cancelled" General Usage	2	69	November 16, 2023
Process status is created only New to AiiDA	0	83	March 11, 2024

CalcJob in `QUEUED` status even the actual job on HPC is finished

Related topics