Slurm lost many jobs, can AiiDA deal with that?

pie · February 22, 2025, 10:55am

Dear All,

while running AiiDA on a slurm based cluster, the HPC system had quite a few issues and many jobs went lost, i.e. something like

scontrol show jobid 123455

returns slurm_load_jobs error: Invalid job id specified but the ID is perfectly valid and is written in the stdout of a job that was actually successfully completed.

On the AiiDA side, all calculations are reported as RUNNING.

Generally I would just kill everything and resubmit, but since it’s quite a low of simulations that were all successful, I was wondering if there’s a way to force the workers to collect the results and move forward.

Hacky solutions are welcome.

Thanks!

giovannipizzi · February 22, 2025, 2:27pm

Hi! What does squeue reports? Is the job 123455 still in the queue? Is squeue does not report it and squeue does not return an error, AiiDA should assume the job is done and retrieve it, I think. Is maybe the job still there in some weird state? (if so, you can try to kill those jobs using scancel, and AiiDA will then proceed to retrieve them, if they actually finished I think there should be no problem. You can try with 1 job first, to see if it’s working, if that’s what’s happening

pie · February 23, 2025, 7:57am

Here’s an example:

893835  5D ago     ElkCalculation                 ⏵ Waiting        Monitoring scheduler: job state RUNNING

I use calcjob gotocomputer and land here

$ cat _scheduler-stdout.txt
================================================================================
JobID = 6059487
Partition = standard96:shared, Nodelist = gcn2022
================================================================================
============ Job Information ===================================================
Submitted: 2025-02-18T08:02:20
Started: 2025-02-18T08:02:24
Ended: 2025-02-19T09:06:13
Elapsed: 1504 min, Limit: 2880 min, Difference: 1376 min
CPUs: 16, Nodes: 1
Estimated Consumption: 200.53 core-hours
================================================================================

My squeue -u $USER is empty, this is the relevant part of the job

$ cat _aiidasubmit.sh
#!/bin/bash
#SBATCH --no-requeue
#SBATCH --job-name="aiida-893835"

and finally

$ sacct -j 6059487 --format="JobID,JobName%30,Partition,Account,AllocCPUS,State,ExitCode"
JobID                               JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ------------------------------ ---------- ---------- ---------- ---------- --------
6059487                        aiida-893835 standard9+   bep00079         16  COMPLETED      0:0
6059487.bat+                          batch              bep00079         16  COMPLETED      0:0
6059487.ext+                         extern              bep00079         16  COMPLETED      0:0

Edit: I forgot to post process report which is full of errors like this:

|     raise SchedulerError(
 | aiida.schedulers.scheduler.SchedulerError: squeue returned exit code 1 (_parse_joblist_output function)
 | stdout=''
 | stderr='Loading software stack: nhr-lmod
 | Found project directory, setting $PROJECT_DIR to '/projects/extern/nhr/nhr_be/bep00079/dir.project'
 | Found scratch directory, setting $WORK to '/mnt/lustre-grete/usr/u14590'
 | Found scratch directory, setting $TMPDIR to '/mnt/lustre-grete/tmp/u14590'
 |  __          ________ _      _____ ____  __  __ ______   _______ ____
 |  \ \        / /  ____| |    / ____/ __ \|  \/  |  ____| |__   __/ __ \
 |   \ \  /\  / /| |__  | |   | |   | |  | | \  / | |__       | | | |  | |
 |    \ \/  \/ / |  __| | |   | |   | |  | | |\/| |  __|      | | | |  | |
 |     \  /\  /  | |____| |___| |___| |__| | |  | | |____     | | | |__| |
 |   _  \/ _\/  _|______|______\_____\____/|_|  |_|______|____|_|__\____/
 |  | \ | | |  | |  __ \     ____    / ____\ \        / /  __ \ / ____|
 |  |  \| | |__| | |__) |   / __ \  | |  __ \ \  /\  / /| |  | | |  __
 |  | . ` |  __  |  _  /   / / _` | | | |_ | \ \/  \/ / | |  | | | |_ |
 |  | |\  | |  | | | \ \  | | (_| | | |__| |  \  /\  /  | |__| | |__| |
 |  |_| \_|_|  |_|_|  \_\  \ \__,_|  \_____|   \/  \/   |_____/ \_____|
 |                          \____/
 |
 |  Documentation  https://docs.hpc.gwdg.de   Support nhr-support@gwdg.de
 | slurm_load_jobs error: Connection reset by peer'

giovannipizzi · February 27, 2025, 6:28am

Hi, that’s a bit strange.
Interesting the message slurm_load_jobs error: Connection reset by peer' that probably happened during squeue. However you are saying that now squeue works fine? (you don’t get any error?)

If AiiDA is not picking it up, maybe the daemon lost the corresponding task?

You might try this first. Not sure if it will help, but let’s try first.

Also good to confirm that squeue does not give any error.

giovannipizzi · February 27, 2025, 6:35am

One more thing, try to run the squeue commands also with the option --jobs=6059487,6059487 which is what aiida does - yes, the job ID twice, see comment here:

github.com/aiidateam/aiida-core

src/aiida/schedulers/plugins/slurm.py

f4c55f5f7


      
          if jobs:
              joblist = []
              if isinstance(jobs, str):
                  joblist.append(jobs)
              else:
                  if not isinstance(jobs, (tuple, list)):
                      raise TypeError("If provided, the 'jobs' variable must be a string or a list of strings")
                  joblist = jobs
          
              # Trick: When asking for a single job, append the same job once more.
              # This helps provide a reliable way of knowing whether the squeue command failed (if its exit code is
              # non-zero, _parse_joblist_output assumes that an error has occurred and raises an exception).
              # When asking for a single job, squeue also returns a non-zero exit code if the corresponding job is
              # no longer in the queue (stderr: "slurm_load_jobs error: Invalid job id specified"), which typically
              # happens once in the life time of an AiiDA job,
              # However, when providing two or more jobids via `squeue --jobs=123,234`, squeue stops caring whether
              # the jobs are still in the queue and returns exit code zero irrespectively (allowing AiiDA to rely on the
              # exit code for detection of real issues).
              # Duplicating job ids has no other effect on the output.
              # Verified on slurm versions 17.11.2, 19.05.3-2 and 20.02.2.

This file has been truncated. show original

Topic		Replies	Views
Handling SLURM Job Failures in AiiDA: Detection and Automatic Restart of Stalled Calculations General Usage aiida	2	55	August 9, 2024
AiiDA not getting number of jobs from scheduler General Usage	12	119	November 30, 2023
Remote Computer Not Running sbatch General Usage	3	53	June 18, 2024
Graceful kill - instruct a paused job to retrieve results General Usage question	7	99	November 29, 2023
Implementing the Flux scheduler from LLNL Developer question , aiida , plugin	11	130	April 16, 2025

Slurm lost many jobs, can AiiDA deal with that?

Related topics