In typical failures of DFTK the parser will exit with self.exit_codes.ERROR_MISSING_SCFRES_FILE because the output file does not get written. But it can happen that DFTK fails after writing the output file. What I am looking for is a way to detect that the process did not exit successfully, and if that is the case fail the calcjob? Or is it expected that the parser should detect these failure cases because some output file is missing?
to the best of my knowledge, this should indeed be implemented in the parser.
My justification for this: AiiDA periodically checks the queue whether the job is still running. If it disappears from the queue, the job is considered as terminated, and the parsing will be triggered. Therefore, there is no direct way for AiiDA to see that a job failed without the parsing (if anyone knows a different way, please correct me).
Coming back to your example, even if all the output files are produced, there should typically be a way to see that DFT-K failed, no? For example, something would be written into the stdout or stderr file of the scheduler.
To revert the question, how would you detect that your job failed without checking any files?
I was thinking of checking that the exit code of the scheduler is indeed 0, but couldn’t find a way to access it from the parser.
there should typically be a way to see that DFT-K failed, no?
DFTK is designed for interactive usage, so I think that an exception would be the usual failure “message”. An uncaught exception should crash the Julia process and cause it to return with a nonzero exit code.
One thing I could do is add a println("AiidaDFTK computation ran successfully!") at the end of the script and fail if this is not in stdout, but it feels uglier than checking if exit_code == 0.
In case of Slurm, the detailed_job_info is created/populated based on the sacct command (aiida-core/src/aiida/schedulers/plugins/slurm.py at main · aiidateam/aiida-core · GitHub), which should contain some ExitCode information. As a note, in that case, you would rely on the assumption that different schedulers do always implement the _get_detailed_job_info_command in a way, that you can access the exit code.
@t-reents is right. AiiDA just checks the status of the job with the scheduler. When that is terminated, it will proceed with calling the parser, if one was specified in the inputs. It is up to the Parser implementation to check if the program ran successfully and how to communicate a failure. The reason AiiDA doesn’t check is because not all programs properly use exit codes to communicate success or failure and because multiple programs (and even arbitrary bash script) can be ran in a submission script.
You are right that checking the exit code of the program for a nonzero value would indeed make the most sense, and that is exactly what I have done for aiida-shell. The calcjob always adds echo $? > status to the append_text. This means write the exit status of the previously executed command to the status file.
This ensures that that line is added to the submission script directly after the main execution line. So it looks something like:
...
./main_program --flag input
echo $? > status
The status file is then added to the retrieve_list and the Parser can then check the value in there and if it is not 0 it returns an AiiDA exit code to indicate the CalcJob as failed. You could use the same approach in your calcjob.
Thanks @sphuber that is a clever trick. However, I wonder what will happen when running the program on multiple nodes? Is there only a single mpirun call that I will be able to capture the output of, or will each node make its own mpirun call? In the latter case, there might be a race condition when trying to write the exit code to the file.
This means that echo $? is not run within the MPI context. It will run after MPI exits and the value written to status will be the exit status of the mpirun invocation. So as long as that returns a correct exit status, it should work as expected. There shouldn’t be a race condition as it is not the MPI processes that are each executing the echo
Yes, I agree that it will be fine when running on a single machine.
Will that also hold when running on different nodes in a cluster? Admittedly I am not too familiar with the specifics there. Is there a single mpirun call that runs on one node, for the entire reservation? Or does each node perform an mpirun call? Say we have 10 nodes, each with 32 cores. Do you know if there will be only one mpirun call for all 320 processes, or if there will be 10 of them for 32 processes each?