@ali-khosravi I appreciate the feedback. I’m working on getting approval to post this in an external git repository which shouldn’t take too long. I followed the output by printing out I think are key variables. This is what some of that looks like.
job=[‘f4Ccob2aKLnb’, ‘R’, ‘{“user”: {“uri”: “ssh://tioga21/var/tmp/keilbart/flux-jogBe3/local-0”}}’, ‘keilbart’, ‘1’, ‘64’, ‘tioga21’, ‘pdebug’, ‘1800.0’, ‘11.969725608825684’, ‘1744739449.578227’, ‘aiida-2492’, ‘1744739449.5373938’]
thisjob_dict={‘job_id’: ‘f4Ccob2aKLnb’, ‘state_raw’: ‘R’, ‘annotation’: ‘{“user”: {“uri”: “ssh://tioga21/var/tmp/keilbart/flux-jogBe3/local-0”}}’, ‘username’: ‘keilbart’, ‘number_nodes’: ‘1’, ‘number_cpus’: ‘64’, ‘allocated_machines’: ‘tioga21’, ‘partition’: ‘pdebug’, ‘time_limit’: ‘1800.0’, ‘time_used’: ‘11.969725608825684’, ‘dispatch_time’: ‘1744739449.578227’, ‘job_name’: ‘aiida-2492’,
‘submission_time’: ‘1744739449.5373938’}
job_state_string=<JobState.RUNNING: ‘running’>
job_list=[JobInfo({‘job_id’: ‘f4Ccob2aKLnb’, ‘annotation’: ‘{“user”: {“uri”: “ssh://tioga21/var/tmp/keilbart/flux-jogBe3/local-0”}}’, ‘job_state’: <JobState.RUNNING: ‘running’>, ‘job_owner’: ‘keilbart’, ‘num_machines’: 1, ‘num_mpiprocs’: 64, ‘allocated_machines_raw’: ‘tioga21’, ‘queue_name’: ‘pdebug’, ‘requested_wallclock_time_seconds’: ‘1800.0’, ‘wallclock_time_seconds’: ‘11.969725608825684’, ‘dispatch_time’: ‘1744739449.578227’, ‘submission_time’: ‘1744739449.5373938’})]
Warning: key ‘symmetries’ is not present in raw output dictionary
Error: ERROR_OUTPUT_STDOUT_INCOMPLETE
Error: Both the stdout and XML output files could not be read or parsed.
Warning: output parser returned exit code<305>: Both the stdout and XML output files could not be read or parsed.
From what I’m understanding, the last time the get_jobs() command is run in the scheduler.py it is finding my job submitted and currently in a running state. It only calls it once from what I can tell which tells me that the job_state is getting changed somewhere else? I was trying to make my way through the code but it was taking a bit and I thought there might be some other ideas on it.
If I run this through another cluster that is running slurm I don’t have any issues. I’ve set the output to debug for AiiDA and I see that it checks the job and would then schedule another request to update the CalcJob. This isn’t happening with the Flux server I am using. It seems to go to submitting a retrieve job right after checking the job. It does receive information from the get_jobs() function and has information stored in the jobs_cache as well.
Debug: Transport request closing transport for AuthInfo for keilbart1@llnl.gov on tioga
Info: updating CalcJob<2555> successful
Debug: Adding projection of node_1: [‘‘]
Debug: projections have become: [{’’: {}}]
Debug: edge_tag chosen: main–node_1
Debug: Adding projection of main–node_1: [‘type’, ‘label’]
Debug: projections have become: [{‘type’: {}}, {‘label’: {}}]
Debug: projections data: {‘main’: , ‘node_1’: [{‘‘: {}}], ‘main–node_1’: [{‘type’: {}}, {‘label’: {}}]}
Debug: projection for main: []
Debug: projection for node_1: [{’’: {}}]
Debug: Checking projections for edges: This is edge main–node_1 from node_1, with_outgoing of main
Debug: projection for main–node_1: [{‘type’: {}}, {‘label’: {}}]
Info: Process<2555>: Broadcasting state change: state_changed.waiting.waiting
This is a snippet of what the debug output is putting for me. There’s nothing in between updating the calcjob and changing the state that I can understand would cause the job state to be updated. On the other cluster when it sees there are no jobs it then updates which makes sense to me.
Any other thoughts would be appreciated. Otherwise, I’ll check back in when I have this on an external git repository.