How to fail calcjob when the underlying process failed?

Technici4n · November 11, 2024, 1:23pm

Hi!

I am working on a plugin to integrate AiiDA with DFTK. For now I would like to fail the calcjob if running DFTK failed.

Here is the parser implementation: aiida-dftk/src/aiida_dftk/parsers.py at 377bbf170de0ed545bebefa612d301b51d0a3b68 · aiidaplugins/aiida-dftk · GitHub.

In typical failures of DFTK the parser will exit with self.exit_codes.ERROR_MISSING_SCFRES_FILE because the output file does not get written. But it can happen that DFTK fails after writing the output file. What I am looking for is a way to detect that the process did not exit successfully, and if that is the case fail the calcjob? Or is it expected that the parser should detect these failure cases because some output file is missing?

Thanks a lot!

t-reents · November 12, 2024, 3:43pm

Hi,

to the best of my knowledge, this should indeed be implemented in the parser.
My justification for this: AiiDA periodically checks the queue whether the job is still running. If it disappears from the queue, the job is considered as terminated, and the parsing will be triggered. Therefore, there is no direct way for AiiDA to see that a job failed without the parsing (if anyone knows a different way, please correct me).

Coming back to your example, even if all the output files are produced, there should typically be a way to see that DFT-K failed, no? For example, something would be written into the stdout or stderr file of the scheduler.

To revert the question, how would you detect that your job failed without checking any files?

Technici4n · November 12, 2024, 4:19pm

I was thinking of checking that the exit code of the scheduler is indeed 0, but couldn’t find a way to access it from the parser.

there should typically be a way to see that DFT-K failed, no?

DFTK is designed for interactive usage, so I think that an exception would be the usual failure “message”. An uncaught exception should crash the Julia process and cause it to return with a nonzero exit code.

One thing I could do is add a println("AiidaDFTK computation ran successfully!") at the end of the script and fail if this is not in stdout, but it feels uglier than checking if exit_code == 0.

t-reents · November 12, 2024, 4:39pm

I see. I think there might be a way to inspect it. You could try to access the detailed_job_info in your parser, e.g. similarly to what is done for the parsing of the scheduler output: aiida-core/src/aiida/engine/processes/calcjobs/calcjob.py at dd866ce816e986285f2c5794f431b6e3c68a369b · aiidateam/aiida-core · GitHub

In case of Slurm, the detailed_job_info is created/populated based on the sacct command (aiida-core/src/aiida/schedulers/plugins/slurm.py at main · aiidateam/aiida-core · GitHub), which should contain some ExitCode information. As a note, in that case, you would rely on the assumption that different schedulers do always implement the _get_detailed_job_info_command in a way, that you can access the exit code.

sphuber · November 12, 2024, 5:59pm

@t-reents is right. AiiDA just checks the status of the job with the scheduler. When that is terminated, it will proceed with calling the parser, if one was specified in the inputs. It is up to the Parser implementation to check if the program ran successfully and how to communicate a failure. The reason AiiDA doesn’t check is because not all programs properly use exit codes to communicate success or failure and because multiple programs (and even arbitrary bash script) can be ran in a submission script.

You are right that checking the exit code of the program for a nonzero value would indeed make the most sense, and that is exactly what I have done for aiida-shell. The calcjob always adds echo $? > status to the append_text. This means write the exit status of the previously executed command to the status file.

github.com

sphuber/aiida-shell/blob/14866d1450aa252ec414373e86e98611e2eae9db/src/aiida_shell/calculations/shell.py#L318


      
          code_info.stdin_name = filename_stdin
          code_info.stdout_name = self.node.get_option('output_filename') or self.FILENAME_STDOUT
          
          if self.node.get_option('redirect_stderr'):
              code_info.join_files = True
          else:
              code_info.stderr_name = self.FILENAME_STDERR
          
          calc_info = CalcInfo()
          calc_info.codes_info = [code_info]
          calc_info.append_text = f'echo $? > {self.FILENAME_STATUS}'
          calc_info.remote_copy_list = remote_copy_list
          calc_info.remote_symlink_list = remote_symlink_list
          calc_info.retrieve_temporary_list = retrieve_list
          calc_info.provenance_exclude_list = [p.name for p in dirpath.iterdir()]
          calc_info.file_copy_operation_order = [
              FileCopyOperation.REMOTE,
              FileCopyOperation.LOCAL,
              FileCopyOperation.SANDBOX,
          ]

This ensures that that line is added to the submission script directly after the main execution line. So it looks something like:

...
./main_program --flag input
echo $? > status

The status file is then added to the retrieve_list and the Parser can then check the value in there and if it is not 0 it returns an AiiDA exit code to indicate the CalcJob as failed. You could use the same approach in your calcjob.

Technici4n · November 14, 2024, 10:54am

Thanks @sphuber that is a clever trick. However, I wonder what will happen when running the program on multiple nodes? Is there only a single mpirun call that I will be able to capture the output of, or will each node make its own mpirun call? In the latter case, there might be a race condition when trying to write the exit code to the file.

sphuber · November 14, 2024, 3:56pm

The submission script would look something like:

...
mpirun -n 8 main_program --flag input > stdout
echo $? > status

This means that echo $? is not run within the MPI context. It will run after MPI exits and the value written to status will be the exit status of the mpirun invocation. So as long as that returns a correct exit status, it should work as expected. There shouldn’t be a race condition as it is not the MPI processes that are each executing the echo

Technici4n · November 15, 2024, 10:08am

Yes, I agree that it will be fine when running on a single machine.

Will that also hold when running on different nodes in a cluster? Admittedly I am not too familiar with the specifics there. Is there a single mpirun call that runs on one node, for the entire reservation? Or does each node perform an mpirun call? Say we have 10 nodes, each with 32 cores. Do you know if there will be only one mpirun call for all 320 processes, or if there will be 10 of them for 32 processes each?

Topic		Replies	Views
Is the scheduler error out of walltime actually working? General Usage	20	130	February 7, 2024
AiiDA not getting number of jobs from scheduler General Usage	12	119	November 30, 2023
Handling SLURM Job Failures in AiiDA: Detection and Automatic Restart of Stalled Calculations General Usage aiida	2	55	August 9, 2024
Failed with error but no error in reality New to AiiDA	3	62	April 19, 2024
Aiida-vibroscopy: failed with exit status 305: Both the stdout and XML output files could not be read or parsed New to AiiDA	10	138	March 8, 2024

How to fail calcjob when the underlying process failed?

Related topics