Handling SLURM Job Failures in AiiDA: Detection and Automatic Restart of Stalled Calculations

Dear Developers,

I have a question or more like a doubt regarding a problem I have,

Problem: When submitting a calculation in AiiDA, the node is marked as RUNNING. However, due to issues such as failures in the computing resources or the SLURM workload manager, there are instances where the job appears to be running, but the underlying software is not actually executing. This can happen, for example, when SLURM encounters the error:

run: Job 55039674 step creation still disabled, retrying (Requested nodes are busy)

And maybe it nevers starts and a workchain can be finished with an error,

In such cases, the calculation remains in the RUNNING state in AiiDA, but no progress is made because the job is stalled. Additionally, the corresponding aiida.out file may be empty, indicating that the actual computation has not started.

Things to consider for a solution:

Proposed Solutions:

  1. Daemon-Level Check:
  • The AiiDA daemon could periodically monitor running calculations to detect this scenario?. If the daemon detects that a calculation is stalled (e.g., by checking the aiida.out file for content or other indicators of progress), it could automatically stop the calculation and re-launch it. (Perhaps this one is not ideal since it only communicates with slurm if the job is running or not , right? )
  1. Plugin-Level Handling:
  • Alternatively, the logic to handle such errors could be implemented at the plugin level. When retrieving the _scheduler-stderr.txt file, the plugin could check for specific SLURM error messages (such as the one above) and verify whether the aiida.out file is empty. If these conditions are met, the plugin could initiate a restart of the calculation.?

What is your oppinion?

Hi @AndresOrtegaGuerrero

just some small thoughts that came to my mind. I’ve also encountered such issues recently. In my case, it mostly happened when using HyperQueue and multiple Slurm steps were created within a longer running job.

Concerning your proposed solutions, personally, I’m not sure whether this should be done by the daemon per default. Probably the performance wouldn’t be significantly affected, but still, it’d be some additional load. Moreover, I’ve never encountered these issues except for the HyperQueue scenario, and especially not on all HPC systems. So assuming that this might be a problem not relevant for every user, one simple solution would be to add a monitor to your CalcJobs. This would exactly implement the solution that you described in the first point, but still keeps some flexibility in the sense that it’s not added to the default tasks of the daemon.

Regarding the second point, sure, this might be another solution.

@t-reents makes some great comments. I also wouldn’t integrate this directly in the daemon (the workers really) because this is a problem that is SLURM specific.

It also really sounds there is a problem with your SLURM installation or with the submission script that is being generated. When SLURM complains that it cannot create another step, it often means the allocated resources are oversubscribed and you are trying to run more tasks then there are available CPUs. So I would really just debug that first before trying to patch the symptoms.

Regarding your second solution, you could have the parser to check these files and check for this specific problem, however, there are a few problems:

  • This doesn’t solve the job being stuck for a long time not doing anything. The parsing only occurs after the job was terminated by SLURM
  • The Parser does not have the power to resubmit the job. It can just parse the files and return a particular exit code to communicate the error. You would have to wrap the CalcJob in a WorkChain which would then check the error and this could then resubmit the job

If you still want/need to fix this in AiiDA, I agree with @t-reents that your best bet here would be to attach a monitor to the CalcJob. A monitor is a function that you can define that is then called while the job is running with SLURM. You can use it to retrieve the output files and parse them and if you notice the problem, you can instruct the job to be killed. For the restart, you would still need the wrapping base restart workchain though.