while running AiiDA on a slurm based cluster, the HPC system had quite a few issues and many jobs went lost, i.e. something like
scontrol show jobid 123455
returns slurm_load_jobs error: Invalid job id specified but the ID is perfectly valid and is written in the stdout of a job that was actually successfully completed.
On the AiiDA side, all calculations are reported as RUNNING.
Generally I would just kill everything and resubmit, but since it’s quite a low of simulations that were all successful, I was wondering if there’s a way to force the workers to collect the results and move forward.
Hi! What does squeue reports? Is the job 123455 still in the queue? Is squeue does not report it and squeue does not return an error, AiiDA should assume the job is done and retrieve it, I think. Is maybe the job still there in some weird state? (if so, you can try to kill those jobs using scancel, and AiiDA will then proceed to retrieve them, if they actually finished I think there should be no problem. You can try with 1 job first, to see if it’s working, if that’s what’s happening
Hi, that’s a bit strange.
Interesting the message slurm_load_jobs error: Connection reset by peer' that probably happened during squeue. However you are saying that now squeue works fine? (you don’t get any error?)
If AiiDA is not picking it up, maybe the daemon lost the corresponding task?
You might try this first. Not sure if it will help, but let’s try first.
Also good to confirm that squeue does not give any error.
One more thing, try to run the squeue commands also with the option --jobs=6059487,6059487 which is what aiida does - yes, the job ID twice, see comment here: