We are currently running DFT+U calculations using PwBaseWorkChain, and we’ve encountered an issue where some PwCalculation jobs are stopping due to hardware-related problems. Specifically, the calculations are finishing with exit code 312 OR 305, and the corresponding PwBaseWorkChain returns exit code 300. In these cases, the calculations are not being relaunched automatically.
For our purposes, we are modifying our workflow to detect such cases and manually relaunch the calculations. However, we were wondering if there are any existing mechanisms or best practices within AiiDA or aiida-quantumespresso to better handle such hardware failures.
For reference, please find below two examples of failed calculations and their corresponding logs:
**Finished with Exit code 305**
*** 7561: None
*** (empty scheduler output file)
*** Scheduler errors:
/var/spool/slurmd/job20575547/slurm_script: line 27: 707130 Bus error ‘mpirun’ ‘-np’ ‘4’ ‘–map-by’ ‘socket:PE=8’ ‘–rank-by’ ‘core’ ‘pw.x’ ‘-in’ ‘aiida.in’ > ‘aiida.out’
*** 4 LOG MESSAGES:
±> WARNING at 2025-09-22 10:03:20.429657+02:00
| key ‘symmetries’ is not present in raw output dictionary
±> ERROR at 2025-09-22 10:03:20.475065+02:00
| ERROR_OUTPUT_STDOUT_INCOMPLETE
±> ERROR at 2025-09-22 10:03:20.480487+02:00
| Both the stdout and XML output files could not be read or parsed.
±> WARNING at 2025-09-22 10:03:20.483329+02:00
| output parser returned exit code<305>: Both the stdout and XML output files could not be read or parsed.
***Finished with Exit code 312*****
8018: None
*** (empty scheduler output file)
*** Scheduler errors:
[lrdn2595:472876:0:472876] Caught signal 7 (Bus error: nonexistent physical address)
[lrdn2595:472873:0:472873] Caught signal 7 (Bus error: nonexistent physical address)
[lrdn2595:472874:0:472874] Caught signal 7 (Bus error: nonexistent physical address)
[lrdn2595:472875:0:472875] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 472875) ====
0 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x14f6a1e5b3cc]
1 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libucs.so.0(+0x2c5b4) [0x14f6a1e5b5b4]
2 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libucs.so.0(+0x2c71a) [0x14f6a1e5b71a]
3 /leonardo/prod/opt/compilers/cuda/11.8/none/lib64/libcudart.so.11.0(cudaFree+0) [0x14f70309b3e0]
4 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/nvhpc-23.1-x5lw6edfmfuot2ipna3wseallzl4oolm/Linux_x86_64/23.1/compilers/lib/libcudafor_118.so(__dev_dealloc03_i8+0xe5) [0x14f6fd2ad286]
5 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/nvhpc-23.1-x5lw6edfmfuot2ipna3wseallzl4oolm/Linux_x86_64/23.1/compilers/lib/libcudafor_118.so(pgf90_dev_dealloc03_i8+0x5d) [0x14f6fd2ad4ec]
6 pw.x() [0xe4b8f8]
7 pw.x() [0xe4ba0d]
8 pw.x() [0x62190a]
9 pw.x() [0x4181d3]
10 pw.x() [0x415a4e]
11 pw.x() [0x529bf9]
12 pw.x() [0x41222b]
13 pw.x() [0x412073]
14 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(__libc_start_main+0xe5) [0x14f6fa1a9d85]
15 pw.x() [0x40d6ee]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 472873 on node lrdn2595 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
*** 3 LOG MESSAGES:
±> ERROR at 2025-09-22 11:18:56.261530+02:00
| ERROR_OUTPUT_STDOUT_INCOMPLETE
±> ERROR at 2025-09-22 11:18:56.266862+02:00
| The stdout output file was incomplete probably because the calculation got interrupted.
±> WARNING at 2025-09-22 11:18:56.270159+02:00
| output parser returned exit code<312>: The stdout output file was incomplete probably because the calculation got interrupted.
Any insights or suggestions would be greatly appreciated!
Thanks in advance for your help!
Best regards,
Anumita