Slurm lost many jobs, can AiiDA deal with that?

Dear All,

while running AiiDA on a slurm based cluster, the HPC system had quite a few issues and many jobs went lost, i.e. something like

scontrol show jobid 123455

returns slurm_load_jobs error: Invalid job id specified but the ID is perfectly valid and is written in the stdout of a job that was actually successfully completed.

On the AiiDA side, all calculations are reported as RUNNING.

Generally I would just kill everything and resubmit, but since it’s quite a low of simulations that were all successful, I was wondering if there’s a way to force the workers to collect the results and move forward.

Hacky solutions are welcome.

Thanks!

Hi! What does squeue reports? Is the job 123455 still in the queue? Is squeue does not report it and squeue does not return an error, AiiDA should assume the job is done and retrieve it, I think. Is maybe the job still there in some weird state? (if so, you can try to kill those jobs using scancel, and AiiDA will then proceed to retrieve them, if they actually finished I think there should be no problem. You can try with 1 job first, to see if it’s working, if that’s what’s happening

Here’s an example:

893835  5D ago     ElkCalculation                 ⏵ Waiting        Monitoring scheduler: job state RUNNING

I use calcjob gotocomputer and land here

$ cat _scheduler-stdout.txt
================================================================================
JobID = 6059487
Partition = standard96:shared, Nodelist = gcn2022
================================================================================
============ Job Information ===================================================
Submitted: 2025-02-18T08:02:20
Started: 2025-02-18T08:02:24
Ended: 2025-02-19T09:06:13
Elapsed: 1504 min, Limit: 2880 min, Difference: 1376 min
CPUs: 16, Nodes: 1
Estimated Consumption: 200.53 core-hours
================================================================================

My squeue -u $USER is empty, this is the relevant part of the job

$ cat _aiidasubmit.sh
#!/bin/bash
#SBATCH --no-requeue
#SBATCH --job-name="aiida-893835"

and finally

$ sacct -j 6059487 --format="JobID,JobName%30,Partition,Account,AllocCPUS,State,ExitCode"
JobID                               JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ------------------------------ ---------- ---------- ---------- ---------- --------
6059487                        aiida-893835 standard9+   bep00079         16  COMPLETED      0:0
6059487.bat+                          batch              bep00079         16  COMPLETED      0:0
6059487.ext+                         extern              bep00079         16  COMPLETED      0:0

Edit: I forgot to post process report which is full of errors like this:

|     raise SchedulerError(
 | aiida.schedulers.scheduler.SchedulerError: squeue returned exit code 1 (_parse_joblist_output function)
 | stdout=''
 | stderr='Loading software stack: nhr-lmod
 | Found project directory, setting $PROJECT_DIR to '/projects/extern/nhr/nhr_be/bep00079/dir.project'
 | Found scratch directory, setting $WORK to '/mnt/lustre-grete/usr/u14590'
 | Found scratch directory, setting $TMPDIR to '/mnt/lustre-grete/tmp/u14590'
 |  __          ________ _      _____ ____  __  __ ______   _______ ____
 |  \ \        / /  ____| |    / ____/ __ \|  \/  |  ____| |__   __/ __ \
 |   \ \  /\  / /| |__  | |   | |   | |  | | \  / | |__       | | | |  | |
 |    \ \/  \/ / |  __| | |   | |   | |  | | |\/| |  __|      | | | |  | |
 |     \  /\  /  | |____| |___| |___| |__| | |  | | |____     | | | |__| |
 |   _  \/ _\/  _|______|______\_____\____/|_|  |_|______|____|_|__\____/
 |  | \ | | |  | |  __ \     ____    / ____\ \        / /  __ \ / ____|
 |  |  \| | |__| | |__) |   / __ \  | |  __ \ \  /\  / /| |  | | |  __
 |  | . ` |  __  |  _  /   / / _` | | | |_ | \ \/  \/ / | |  | | | |_ |
 |  | |\  | |  | | | \ \  | | (_| | | |__| |  \  /\  /  | |__| | |__| |
 |  |_| \_|_|  |_|_|  \_\  \ \__,_|  \_____|   \/  \/   |_____/ \_____|
 |                          \____/
 |
 |  Documentation  https://docs.hpc.gwdg.de   Support nhr-support@gwdg.de
 | slurm_load_jobs error: Connection reset by peer'

Hi, that’s a bit strange.
Interesting the message slurm_load_jobs error: Connection reset by peer' that probably happened during squeue. However you are saying that now squeue works fine? (you don’t get any error?)

If AiiDA is not picking it up, maybe the daemon lost the corresponding task?

You might try this first. Not sure if it will help, but let’s try first.

Also good to confirm that squeue does not give any error.

One more thing, try to run the squeue commands also with the option --jobs=6059487,6059487 which is what aiida does - yes, the job ID twice, see comment here: