Is the scheduler error out of walltime actually working?

bastonero · January 30, 2024, 3:23pm

Hi everyone,
I have an implementation of a Parser that should be able to catch the “built-in” CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_WALLTIME and just return that error. My pytests are showing that my implementation works, provided the ERROR_SCHEDULER_OUT_OF_WALLTIME is detected.
Nevertheless, on a real test CalcJob, running out of walltime on purpose, it is not catching the scheduler OOW error.

This is the report of the CalcJob:

*** 3952595: None
*** Scheduler output:
[...]

*** Scheduler errors:
[...]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 5476661 ON gcn2101 CANCELLED AT 2024-01-30T13:34:08 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 5476661.2 ON gcn2101 CANCELLED AT 2024-01-30T13:34:08 DUE TO TIME LIMIT ***
[...]

Any idea why it is happening? I am using aiida-core 2.4.0post0 (a branch from Sebastiaan P. Huber).

danielhollas · January 30, 2024, 3:54pm

It would be helpful to see the code of the parser.

The scheduler errors in the report seems like they are expected. How exactly are you getting the report? What is the exit code of the Calcjob?

bastonero · January 30, 2024, 4:42pm

This is the parse method of the Parser:

    def parse(self, **kwargs):
        """Parse the contents of the output files retrieved in the `FolderData`."""
        logs = get_logging_container()

        _, parsed_data, logs = self.parse_stdout_from_retrieved(logs)

        base_exit_code = self.check_base_errors(logs)
        if base_exit_code:
            return self.exit(base_exit_code, logs)

        self.add_units(parsed_data)
        self.out('parameters', orm.Dict(parsed_data))

        # First check whether the scheduler already reported an exit code.
        if self.node.exit_status is not None:

            # The following scheduler errors should correspond to cases where we can simply restart the calculation
            # and have a chance that the calculation will succeed as the error can be transient.
            recoverable_scheduler_error = self.node.exit_status in [
                FlareCalculation.exit_codes.ERROR_SCHEDULER_OUT_OF_WALLTIME.status,
                FlareCalculation.exit_codes.ERROR_SCHEDULER_NODE_FAILURE.status,
            ]

            if recoverable_scheduler_error:
                # Now it is unlikely we can provide a more specific exit code so we keep the scheduler one.
                return ExitCode(self.node.exit_status, self.node.exit_message)

        if 'retrieved_temporary_folder' in kwargs:  # might not be there?
            exit_code = self.parse_model(kwargs['retrieved_temporary_folder'])
            if exit_code in logs.error:
                return self.exit(self.exit_codes.get(exit_code), logs)

            if not self.is_fake:
                settings = {}
                if 'settings' in self.node.inputs:
                    settings = _lowercase_dict(self.node.inputs.settings.get_dict(), dict_name='settings')
                units = settings.get('units', FlareCalculation._default_units)
                self.parse_dft_trajectory(kwargs['retrieved_temporary_folder'], units)

        for exit_code in list(self.get_error_map().values()) + ['ERROR_OUTPUT_STDOUT_INCOMPLETE']:
            if exit_code in logs.error:
                return self.exit(self.exit_codes.get(exit_code), logs)

        return self.exit(logs=logs)

This is the full log:

*** 3952595: None
*** Scheduler output:
================================================================================
JobID = 5476661
User = hbiblore, Account = hbiblore
Partition = standard96, Nodelist = gcn2101
================================================================================
============ Job Information ===================================================
Submitted: 2024-01-30T12:33:38
Started: 2024-01-30T12:33:39
Ended: 2024-01-30T13:34:11
Elapsed: 61 min, Limit: 60 min, Difference: -1 min
CPUs: 192, Nodes: 1
Estimated NPL: 14.1087
================================================================================

*** Scheduler errors:
Module for Anaconda3 2020.11 loaded.
Module for Intel Compilers and Libraries 2018.5 loaded.
Module for Intel(R) MPI Library 2018.4 loaded.
Module for hdf5 (Version 1.10.5) loaded.
Module for GCC (version 9.3.0) loaded.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 5476661 ON gcn2101 CANCELLED AT 2024-01-30T13:34:08 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 5476661.2 ON gcn2101 CANCELLED AT 2024-01-30T13:34:08 DUE TO TIME LIMIT ***
/scratch/usr/hbiblore/.conda/envs/flare/lib/python3.8/site-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.0.4) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/scratch/usr/hbiblore/.conda/envs/flare/lib/python3.8/site-packages/ase/md/md.py:48: FutureWarning: Specify the temperature in K using the 'temperature_K' argument
  warnings.warn(FutureWarning(w))

*** 3 LOG MESSAGES:
+-> ERROR at 2024-01-30 13:36:34.332489+01:00
 | ERROR_OUTPUT_STDOUT_INCOMPLETE
+-> ERROR at 2024-01-30 13:36:34.336965+01:00
 | The stdout output file was incomplete probably because the calculation got interrupted.
+-> WARNING at 2024-01-30 13:36:34.339480+01:00
 | output parser returned exit code<312>: The stdout output file was incomplete probably because the calculation got interrupted.

bastonero · January 30, 2024, 4:45pm

Ah, sorry, my bad, I was testing with a “not-interrupted” output… Just noticed I will emit a base_exit_code in first place if detected, and the interrupted output is catched first. Sorry for wasting your time.

sphuber · January 30, 2024, 7:02pm

This is indeed a bit of a pit-fall. The exit code set by the scheduler parser, if any, will be overwritten by any exit code returned by the “main” Parser. So the Parser implementation has to explicitly check if an exit code is already set on the node and decide whether it wants to keep that, or override it with another one. I agree that this is not immediately obvious but I wasn’t sure how else to do it, apart from starting to allow multiple exit codes, but I rejected that since that would have had a lot of consequences in other parts of the code.

bastonero · January 30, 2024, 10:56pm

Thanks for the reply. I actually recalled that self.check_base_errors(logs) doesn’t check for the interrupted stdout, hence, it seems indeed that self.node.exit_status is simply None, although the scheduler was OOW. So, it is not clear to me where’s the pitfall in the code.

sphuber · January 31, 2024, 8:17am

Did the scheduler actually detect the OOW error and set the exit code? The SlurmScheduler plugin checks the State key of the detailed job info and if it is TIMEOUT it will return the ERROR_SCHEDULER_OUT_OF_WALLTIME:

if data['State'] == 'TIMEOUT':
    return CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_WALLTIME

Could you check the attributes of the CalcJobNode that ran out of walltime and report the contents here? I would like to see if the state attribute contains TIMEOUT or not.

bastonero · January 31, 2024, 10:18am

This is the result of verdi node attributes <PK>

{
    "append_text": "",
    "custom_scheduler_commands": "",
    "detailed_job_info": {
        "retval": 1,
        "stderr": "sacct: error: Invalid field requested: \"Reserved\"\n",
        "stdout": ""
    },
    "environment_variables": {},
    "environment_variables_double_quotes": false,
    "exit_message": "The stdout output file was incomplete probably because the calculation got interrupted.",
    "exit_status": 312,
    "import_sys_environment": true,
    "input_filename": "aiida.yaml",
    "job_id": "5477677",
    "last_job_info": {
        "allocated_machines_raw": "gcn2841",
        "annotation": "None",
        "dispatch_time": {
            "date": "2024-01-30T19:08:06.000000",
            "timezone": null
        },
        "job_id": "5477677",
        "job_owner": "hbiblore",
        "job_state": "running",
        "num_machines": 1,
        "num_mpiprocs": 192,
        "queue_name": "standard96",
        "raw_data": [
            "5477677",
            "R",
            "None",
            "gcn2841",
            "hbiblore",
            "1",
            "192",
            "gcn2841",
            "standard96",
            "1:00:00",
            "59:59",
            "2024-01-30T19:08:06",
            "aiida-3958716",
            "2024-01-30T19:08:05"
        ],
        "requested_wallclock_time_seconds": 3600,
        "submission_time": {
            "date": "2024-01-30T19:08:05.000000",
            "timezone": null
        },
        "title": "aiida-3958716",
        "wallclock_time_seconds": 3599
    },
    "max_wallclock_seconds": 3600,
    "metadata_inputs": {
        "metadata": {
            "call_link_label": "iteration_01",
            "dry_run": false,
            "options": {
                "append_text": "",
                "custom_scheduler_commands": "",
                "environment_variables_double_quotes": false,
                "import_sys_environment": true,
                "input_filename": "aiida.yaml",
                "max_wallclock_seconds": 3600,
                "mpirun_extra_params": [],
                "output_filename": "aiida.out",
                "parser_name": "learn.flare",
                "prepend_text": "",
                "resources": {
                    "num_cores_per_mpiproc": 96,
                    "num_machines": 1,
                    "num_mpiprocs_per_machine": 1
                },
                "scheduler_stderr": "_scheduler-stderr.txt",
                "scheduler_stdout": "_scheduler-stdout.txt",
                "submit_script_filename": "_aiidasubmit.sh",
                "withmpi": false
            },
            "store_provenance": true
        }
    },
    "mpirun_extra_params": [],
    "output_filename": "aiida.out",
    "parser_name": "learn.flare",
    "prepend_text": "",
    "process_label": "FlareCalculation",
    "process_state": "finished",
    "remote_workdir": "/scratch-emmy/projects/hbi00059/aiida_fs/7c/cb/0a13-e9cd-4d05-8910-ab9570898a20",
    "resources": {
        "num_cores_per_mpiproc": 96,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "aiida.out",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [
        "aiida_flare.json",
        "aiida_dft.xyz"
    ],
    "scheduler_lastchecktime": "2024-01-30T20:10:14.148966+01:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "submit_script_filename": "_aiidasubmit.sh",
    "version": {
        "core": "2.4.0.post0",
        "plugin": "0.1.0a0"
    },
    "withmpi": false
}

sphuber · January 31, 2024, 10:45am

There is the problem. Apparently, the call to sacct to get the detailed job info failed. So the parser never set the ERROR_SCHEDULER_OUT_OF_WALLTIME exit code. Not sure why your sacct call is failing. What version of SLURM is installed?

bastonero · January 31, 2024, 11:08am

Annn I see, thanks! This is what is installed on the cluster I use: slurm 23.02.7

sphuber · January 31, 2024, 11:30am

It seems the Reserved field is no longer supported by sacct in the latest version: Slurm Workload Manager - sacct

Unfortunately, the SlurmScheduler hardcodes the list of fields it requests. So you would have to manually up date the source code to fix it. Or subclass the plugin and create a new entry point and use that. You should remove the Reserved key in the _detailed_job_info_fields list of SlurmScheduler and it should work (unless more fields have been removed).

sphuber · January 31, 2024, 11:33am

I think this change is the culprit:

github.com

SchedMD/slurm/blob/863ead570d450e25022f04cc5c9cfb379aa8ae4d/RELEASE_NOTES#L181C1-L182C40


      
          -- sacct - Rename 'Reserved' field to 'Planned' to match sreport and the
             nomenclature of the 'Planned' node.

So instead of removing Reserved you could also rename it to Planned.

Ideally, the plugin should not fail if an unsupported field is requested though. Or try to determine it dynamically, but this would involve a second call to sacct at least.

bastonero · January 31, 2024, 11:49am

Or should it be possible to implement version-dependent fields?

sphuber · January 31, 2024, 11:56am

Sure, but that would still require a call to the scheduler to determine the version. But on top of that, you would still have to actively update the plugin whenever a new version changes something. The dynamic solution would hopefully be able to work for all versions without requiring updates.

bastonero · January 31, 2024, 12:26pm

Looking into the parse_output, the detailed_job_info had the "stdout": "", meaning it will catch the first try/except, hence, even if the _scheduler-stderr.txt contains useful info, it will be ignored.

Could it be a solution to try, before raising error, to look into the stderr and, if nothing is found, to raise the error? At least one removes this version incompatibility.

Moreover, as a general question, why all the fileds for format are employed, when eventually only the State field is used?

It would be good to either implement the first or the second solutions, so that plugins building up the OOW of Slurm do not depend on the Slurm versioning, and do not rely on custom aiida-core versions.

sphuber · January 31, 2024, 3:27pm

The problem is that I don’t want to be parsing the content of the stderr file, because that would be text parsing. I intentionally want to parse the output of sacct as that is structured.

The problem is that you would have to match the text to an exit code, and here the text could easily vary between versions for example. It is just not as robust.

This is because the get_detailed_job_info was not added explicitly for parsing errors of failed jobs. It existed before to give a user more information about a completed job. It was just repurposed later on to parse the State key from to determine job failures. We assumed that these fields would not change, but that was clearly an incorrect assumption ^^

I agree that this would be ideal, but I don’t think the proposed solutions are the correct ones. I think we should just find a way to determine the fields dynamically. They can be determined with sacct --helpformat, which returns something like:

$ sacct --helpformat
Account             AdminComment        AllocCPUS           AllocNodes         
AllocTRES           AssocID             AveCPU              AveCPUFreq         
AveDiskRead         AveDiskWrite        AvePages            AveRSS             
AveVMSize           BlockID             Cluster             Comment            
Constraints         ConsumedEnergy      ConsumedEnergyRaw   Container          
CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode    
Elapsed             ElapsedRaw          Eligible            End                
ExitCode            Flags               GID                 Group              
JobID               JobIDRaw            JobName             Layout

The scheduler plugin provides the command to call to get the detailed job info as follows:

    def _get_detailed_job_info_command(self, job_id):
        fields = ','.join(self._detailed_job_info_fields)
        return f'sacct --format={fields} --parsable --jobs={job_id}'

So the fields are specified by concatenating the _detailed_job_info_fields list items with a comma. I can replicate this with

$ sacct --helpformat | tr -s '\n' ' ' | tr ' ' ','
Account,AdminComment,AllocCPUS,AllocNodes,AllocTRES,AssocID,AveCPU,AveCPUFreq,AveDiskRead,AveDiskWrite,AvePages,AveRSS,AveVMSize,BlockID,Cluster,Comment,Constraints,ConsumedEnergy,ConsumedEnergyRaw,Container,CPUTime,CPUTimeRAW,DBIndex,DerivedExitCode,Elapsed,ElapsedRaw,Eligible,End,ExitCode,Flags,GID,Group,JobID,JobIDRaw,JobName,Layout,MaxDiskRead,MaxDiskReadNode,MaxDiskReadTask,MaxDiskWrite,MaxDiskWriteNode,MaxDiskWriteTask,MaxPages,MaxPagesNode,MaxPagesTask,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask,McsLabel,MinCPU,MinCPUNode,MinCPUTask,NCPUS,NNodes,NodeList,NTasks,Partition,Priority,QOS,QOSRAW,Reason,ReqCPUFreq,ReqCPUFreqGov,ReqCPUFreqMax,ReqCPUFreqMin,ReqCPUS,ReqMem,ReqNodes,ReqTRES,Reservation,ReservationId,Reserved,ResvCPU,ResvCPURAW,Start,State,Submit,SubmitLine,Suspended,SystemComment,SystemCPU,Timelimit,TimelimitRaw,TotalCPU,TRESUsageInAve,TRESUsageInMax,TRESUsageInMaxNode,TRESUsageInMaxTask,TRESUsageInMin,TRESUsageInMinNode,TRESUsageInMinTask,TRESUsageInTot,TRESUsageOutAve,TRESUsageOutMax,TRESUsageOutMaxNode,TRESUsageOutMaxTask,TRESUsageOutMin,TRESUsageOutMinNode,TRESUsageOutMinTask,TRESUsageOutTot,UID,User,UserCPU,WCKey,WCKeyID,WorkDir,

So now I am thinking we can change _get_detailed_job_info_command to the following:

    def _get_detailed_job_info_command(self, job_id):
        return f"sacct --format=$(sacct --helpformat | tr -s '\n' ' ' | tr ' ' ',') --parsable --jobs={job_id}"

I just tested this for SLURM 22.05.3 and it seems to work. Maybe you could try updating your SlurmScheduler plugin and run a test calculation? It should always run this code, even when the job finishes just fine, so you don’t even need to force it to fail.

bastonero · January 31, 2024, 6:06pm

Thanks Sebastian, very smart solution!

I think we are almost there. The only issue now is that the parse_output logic still relies on the definition of self._detailed_job_info_fields, which would match wrongly the fields. I assume it’s easy now to just parse directly the fields from the stdout, and zip them together.

Do you see any flaw?

For example:

[...]

lines = detailed_stdout.splitlines()

try:
    master = lines[1]
except IndexError:
    raise ValueError('the `detailed_job_info.stdout` contained less than two lines.')

fields = lines[0].split('|')
attributes = master.split('|')

if len(fields) != len(attributes):
    raise ValueError(
        'second line in `detailed_job_info.stdout` differs in length with the `_detailed_job_info_fields '
        'attribute of the scheduler.'
    )

data = dict(zip(fields, attributes))

[...]

Here the detailed info which may useful for testing.

    "detailed_job_info": {
        "retval": 0,
        "stderr": "",
        "stdout": "Account|AdminComment|AllocCPUS|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|Container|CPUTime|CPUTimeRAW|DBIndex|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Extra|FailedNode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|Licenses|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Partition|Planned|PlannedCPU|PlannedCPURAW|Priority|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqGov|ReqCPUFreqMax|ReqCPUFreqMin|ReqCPUS|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Start|State|Submit|SubmitLine|Suspended|SystemComment|SystemCPU|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|\nhbiblore||192|1|billing=192,cpu=192,mem=362000M,node=1|5388|||||||||ghlrn4||||||20:16:00|72960|15072155|0:0|00:06:20|380|2024-01-31T18:14:27|2024-01-31T18:20:51|0:0|||SchedMain,StartRecieved|31606|hbiblore|5484315|5484315|aiida-3970071||||||||||||||||||||||192|1|gcn2020||standard96|00:00:04|00:06:24|384|100023|normal|1|None|Unknown|Unknown|Unknown|Unknown|96|362000M|1|billing=96,cpu=96,mem=362000M,node=1|||2024-01-31T18:14:31|TIMEOUT|2024-01-31T18:14:27|sbatch _aiidasubmit.sh|00:00:00||18:13.054|00:06:00|6|05:36:12|||||||||||||||||31115|hbiblore|05:17:59|*ingenieurwissenschaften|2265|/scratch-emmy/projects/hbi00059/aiida_fs/93/62/f952-d0a0-48f0-beee-d217efb1f44b|\nhbiblore||192|1|cpu=192,mem=362000M,node=1|5388|00:00:00|2.30M|32.03M|0.16M|9|179420K|1387944K||ghlrn4|||218.84K|218839||20:19:12|73152|15072155||00:06:21|381|2024-01-31T18:14:31|2024-01-31T18:20:52|0:15||||||5484315.batch|5484315.batch|batch|Unknown||32.03M|gcn2020|0|0.16M|gcn2020|0|9|gcn2020|0|179420K|gcn2020|0|1387944K|gcn2020|0||00:00:04|gcn2020|0|192|1|gcn2020|1|||||||||0|0|0|0|192||1||||2024-01-31T18:14:31|CANCELLED|2024-01-31T18:14:31||00:00:00||00:00.145|||00:00.535|cpu=00:00:00,energy=218839,fs/disk=33585254,mem=179420K,pages=9,vmem=1387944K|cpu=00:00:04,energy=218839,fs/disk=33585254,mem=179420K,pages=9,vmem=1387944K|cpu=gcn2020,energy=gcn2020,fs/disk=gcn2020,mem=gcn2020,pages=gcn2020,vmem=gcn2020|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:04,energy=218839,fs/disk=33585254,mem=179420K,pages=9,vmem=1387944K|cpu=gcn2020,energy=gcn2020,fs/disk=gcn2020,mem=gcn2020,pages=gcn2020,vmem=gcn2020|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=218839,fs/disk=33585254,mem=179420K,pages=9,vmem=1387944K|energy=84,fs/disk=171967|energy=936,fs/disk=171967|energy=gcn2020,fs/disk=gcn2020|fs/disk=0|energy=936,fs/disk=171967|energy=gcn2020,fs/disk=gcn2020|fs/disk=0|energy=84,fs/disk=171967|||00:00.389||||\nhbiblore||192|1|billing=192,cpu=192,mem=362000M,node=1|5388|00:00:00|2.30M|0.00M|0.00M|0|3364K|178828K||ghlrn4|||219.38K|219377||20:16:00|72960|15072155||00:06:20|380|2024-01-31T18:14:31|2024-01-31T18:20:51|0:0||||||5484315.extern|5484315.extern|extern|Unknown||0.00M|gcn2020|0|0.00M|gcn2020|0|0|gcn2020|0|3364K|gcn2020|0|178828K|gcn2020|0||00:00:00|gcn2020|0|192|1|gcn2020|1|||||||||0|0|0|0|192||1||||2024-01-31T18:14:31|COMPLETED|2024-01-31T18:14:31||00:00:00||00:00:00|||00:00.001|cpu=00:00:00,energy=219377,fs/disk=2012,mem=3364K,pages=0,vmem=178828K|cpu=00:00:00,energy=219377,fs/disk=2012,mem=3364K,pages=0,vmem=178828K|cpu=gcn2020,energy=gcn2020,fs/disk=gcn2020,mem=gcn2020,pages=gcn2020,vmem=gcn2020|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=219377,fs/disk=2012,mem=3364K,pages=0,vmem=178828K|cpu=gcn2020,energy=gcn2020,fs/disk=gcn2020,mem=gcn2020,pages=gcn2020,vmem=gcn2020|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=219377,fs/disk=2012,mem=3364K,pages=0,vmem=178828K|energy=84,fs/disk=1|energy=936,fs/disk=1|energy=gcn2020,fs/disk=gcn2020|fs/disk=0|energy=936,fs/disk=1|energy=gcn2020,fs/disk=gcn2020|fs/disk=0|energy=84,fs/disk=1|||00:00:00||||\nhbiblore||192|1|cpu=192,mem=362000M,node=1|5388|05:33:53|156K|2745.53M|2.33M|916|59603200K|229345388K||ghlrn4|||164.59K|164595||11:21:36|40896|15072155||00:03:33|213|2024-01-31T18:17:18|2024-01-31T18:20:51|0:0||||||5484315.0|5484315.0|pmi_proxy|Cyclic||2745.53M|gcn2020|0|2.33M|gcn2020|0|916|gcn2020|0|59603200K|gcn2020|0|229345388K|gcn2020|0||05:33:53|gcn2020|0|192|1|gcn2020|1|||||||||Unknown|Unknown|Unknown|Unknown|192||1||||2024-01-31T18:17:18|COMPLETED|2024-01-31T18:17:18|/usr/local/slurm/slurm/current/install/bin/srun --nodelist gcn2020 -N 1 -n 1 --input none /sw/comm/impi/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/pmi_proxy --control-port gcn2020.usr.hlrn.de:43167 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 317447439 --usize -2 --proxy-id -1|00:00:00||18:12.907|||05:36:11|cpu=05:33:53,energy=164595,fs/disk=2878896622,mem=59603200K,pages=916,vmem=229345388K|cpu=05:33:53,energy=164595,fs/disk=2878896622,mem=59603200K,pages=916,vmem=229345388K|cpu=gcn2020,energy=gcn2020,fs/disk=gcn2020,mem=gcn2020,pages=gcn2020,vmem=gcn2020|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=05:33:53,energy=164595,fs/disk=2878896622,mem=59603200K,pages=916,vmem=229345388K|cpu=gcn2020,energy=gcn2020,fs/disk=gcn2020,mem=gcn2020,pages=gcn2020,vmem=gcn2020|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=05:33:53,energy=164595,fs/disk=2878896622,mem=59603200K,pages=916,vmem=229345388K|energy=84,fs/disk=2446026|energy=853,fs/disk=2446026|energy=gcn2020,fs/disk=gcn2020|fs/disk=0|energy=853,fs/disk=2446026|energy=gcn2020,fs/disk=gcn2020|fs/disk=0|energy=84,fs/disk=2446026|||05:17:58||||\n"
    },

sphuber · January 31, 2024, 6:43pm

That makes perfect sense. I made a PR: `SlurmScheduler`: Make detailed job info fields dynamic by sphuber · Pull Request #6270 · aiidateam/aiida-core · GitHub

Could you check that out and give it a shot? There are tests, but only for parse_output, so it won’t actually test the new command string.

bastonero · January 31, 2024, 7:07pm

Awesome. thanks a lot Sebastiaan. The implementation works perfectly now! The stdout is complete, and the TIMEOUT is catched nicely.

giovannipizzi · February 1, 2024, 7:59am

Thanks a lot to both! I added some comments in the PR to make it possibly even easier/more robust.

Topic		Replies	Views
How to fail calcjob when the underlying process failed? Developer	7	28	November 15, 2024
Implementing the Flux scheduler from LLNL Developer question , aiida , plugin	11	130	April 16, 2025
AiiDA not getting number of jobs from scheduler General Usage	12	119	November 30, 2023
Elapsed time for a crashed job General Usage question	10	92	May 24, 2024
Aiida-vibroscopy: failed with exit status 305: Both the stdout and XML output files could not be read or parsed New to AiiDA	10	138	March 8, 2024

Is the scheduler error out of walltime actually working?

Related topics