Elapsed time for a crashed job

Is it possible to calculate elapsed time for a crashed calculation from a calcjob node? Assuming I can parse the start time from the output. Does it make sense to take “retrieved.mtime” of the calcjob node as the time that job has crashed?

Thanks!
Hossein

That won’t be the most accurate and at best give you an upper limit. The retrieval may have happened a lot later than the actual job crashed. AiiDA is polling for the job to be finished which can happen a lot later, or your daemon may not even have been running, so it would only retrieve when you start the daemon again.

If you are using the SLURM plugin, that writes some useful additional info provided by SLURM itself. It is written to the attributes of the calculation node. Have a look at the output of verdi node attributes <PK>. If your SLURM supports it, it should have a key last_job_info with a lot of useful information, including timings I believe.

Thanks!
node attributes show the time the job was submitted (‘submission_time’)
but does not show the time job has started. Probably, SLURM does not support it.
Is it possible to read the last time the output file was modified in the remote folder? Or from retrieved files?

AiiDA does not store any metadata of the files in its file repository. However, if the remote folder still exists, you can check the timestamps of those files, for example the _scheduler-stderr.txt to get an idea of when it was last written to.

That is what I want to do. But I do not know how to check the timestamps of that file in the remote folder!

Hi,
I agree with Sebastiaan, but depending on your needed precision (do you need second-precision, or an error of up to a minute is OK?) and if your daemon was correctly running at the time the calculations crashed, probably the time of the parsed results node will be ~30-60s later than the end time and could be good enough.

From AiiDA’s point of view, indeed, the start time is unknown, you have to hope that either the job wrote it, or you have the info in the scheduler.

I’d recommend to double check again the last_job_info (or similarly named) attribute of the calcjob. We introduced it at the very beginning of AiiDA exactly for this purpose. There was a bug fixed several months ago, so it might not be there if you have an “oldish” version of AiiDA, but otherwise you should have it. If you have slurm, the info should be there, the command was working 10 years ago so I doubt you have a SLURM version without the sacct command (unless you have a version with a different command line… in which case, it would be useful for us to know so we can improve the SLURM plugin).

It would be very useful for us to know if you have it or not, if it’s empty, etc.

In any case, if none of the above works/applies, here is some code using AiiDA to simplify connecting via SSH to the computer and get file information, assuming the variable calcjobnode is the CalcJob node:

path = calcjobnode.outputs.remote_folder.get_remote_path()
# Here you will have to append a filename to the path (that is just the remote folder),
# e.g. path = os.path.append(path, '_scheduler-stderr.txt') or similar
computer = calcjobnode.computer
with calcjobnode.get_transport() as t:
    mode = t.get_attribute(path)

print(mode)

This will return a FileAttribute object, similar to this:

FileAttribute({'st_size': 2596, 'st_uid': 501, 'st_gid': 20, 'st_mode': 33188, 'st_atime': 1702307526.9606802, 'st_mtime': 1702307520.9999373})

where you have UID, GID, file size, file mode, and also atime and mtime. These should follow the output of the stat module, so check the docs there for the meaning of the various times and how to convert to actual times.

Important note: you will open a new SSH connection at every with block. So if you have to check multiple files on the same computer, it’s better to create a list of paths, and open the connection once and run all commands inside the with in a for loop, otherwise you risk to be banned out of the supercomputer center!

2 Likes

Hi,
I think is has to do with AiiDA version. I have two: v2.4.2 and v2.5.1. The node attributes are different although both jobs have been submitted to the same cluster.
I am attaching two files, one for VASP job with AiiDA2.5.1 and the other for CP2K job with AiiDA2.4.2.
As you can see, more information is provided by AiiDA2.4.2. ‘dispatch_time’ and '‘wallclock_time_seconds’ are missing in v2.5.1 (none of the jobs has crashed, these are only examples). The CP2K job started at 16:01:15 and ended at 21:40:31. The VASP job started at 14:03.51 and ended 16.75 seconds later.
‘scheduler_lastchecktime’ seems to be close to the time jobs have been ended but no information in the ‘last_job_info’ for v5.2.1.
vasp_aiida2.5.1.txt (2.2 KB)
cp2k_aiida2.4.2.txt (4.5 KB)

Hi, sorry for the slow reply. Actually, we were imprecise. You actually want to look at the detailed_job_info. The last_job_info is a “cache” of the last check that AiiDA did of the job. The reason that the VASP has less info is that it finished super quickly, so AiiDA checked probably only once (or more) while it was in the queue (PD), and then the next check the job was not in the output of squeue anymore.

Instead, the goal of detailed_job_info is to call the sacct command once the job is done, to fetch and store additional useful information.
The docs are in the doctoring, see this PR.

In AiiDA 2.5 actually the detailed_job_info retrieval fails: sacct: error: Invalid field requested: "Reserved"\n

Recently we changed the way this is obtained, I think exactly because of the Reserved flag having changed name to Planned: here is the PR: `SlurmScheduler`: Make detailed job info fields dynamic by sphuber · Pull Request #6270 · aiidateam/aiida-core · GitHub

Is it possible that your cluster updated SLURM between the two jobs?

The PR I think is still only in main.

You can anyway try to see if it would fix the problem (great to report back if it does not, so we can fix further), but picking the Job ID of a recently terminated job (not too old, the info might be deleted), and run
sacct --format=$(sacct --helpformat | tr -s '\n' ' ' | tr ' ' ',') --parsable --jobs={job_id}
where you replace {job_id} with the SLURM Job ID.
And report back if it works or not.

Unfortunately, for those calculations, you will have to get some estimate of the running time from somewhere else in the meantime.

Dear Giovanni,
Thanks for your reply.
The ‘sacct’ command works. See the attached file.
I managed to get information from last_job_info for jobs that are not very short.
I will use detailed_job_info when the issue is solved.
out.txt (6.9 KB)