A walk through calcjob, from beginning to the end?

ali-khosravi · October 16, 2024, 10:06am

Can somebody tell me what set of functions are being called exactly when a calcjob is being processed? from submission to monitor and retrieve?
what actually happens behind the scenes in plumpy, How are all sorts of tasks being lined up?
It would be perfect to have exact reference to code lines and not just abstract explanation.
Cheers

agoscinski · October 18, 2024, 4:31pm

I will focus on explaining what is happening when using aiida.engine.run with a CalcJob. For submitting it is mostly analogous but one needs to also consider kiwipy+rabbitmq logic that makes it more complicated. In principle the same things happen just with some events being fired so a worker can pick up the process. There is also the persister which I ignore since it is not so essential understanding the running procedure.

Process state logic in plumpy

Since CalcJob is inheriting from plumpy.processes.Process we need to understand this class. Let us first focus on the process states that follow this logic:

                  ___
                 |   v
CREATED (x) --- RUNNING (x) --- FINISHED (o)
                 |   ^          /
                 v   |         /
                WAITING (x) --
                 |   ^
                  ---

You can find the state base class in plumpy.base.state_machine.State. The important method for now is only execute. Its implementation for each state can be found in plumpy.process_states.py (the Create, Runnig, Waiting and Finished classes). Note that the execute method returns another process state. It is a bit obfuscated by this create_state function but it is simply a wrapper to create a new state instance. This wrapper is needed because you can overwrite the default state constructor (e.g when specializing it in the aiida.engine.processes.calcjobs.Waiting). Notice that the plumpy.process_states.Running class accepts a run_fn argument which is executed in execute. If the result of the run_fn is a Command instance, the type of Command determines in the _action_command method the next state. For example if it is a Continue command, it will create a new Running state that executes continue_fn of the Command instance. This is described in the graphic above by the arrow that self references the Running state. For each of the connections in the graphic you find similar corresponding code in plumpy. You can already see that a sequence of states can be produced by just executing one state after the another, as long as the callback is returning states with callbacks that return states, but who executes these states?
It is the plumpy.processes.Process that invokes the execute methods from each process state. The Process class itself implements also a execute member function that does executes the execute member functions of the process states. The traceback is quite nested but not really complex in logic:

plumpy.processes.Process backtrace for execute()
execute()
-> step_until_terminated()
-> step()
-> _run_task(self._state.execute) # <--- execute of process state
-> wrap the provided callback to coroutine and await on it, the callback is the execute() method from a process state

It simply just continues running the execute method of the process states until the Finished state is reached. Furthermore in the step() function the returned state is used to transition to the next state after executing a process. You can find the implementation of the transition_to function in the Process base class the plumpy.base.state_machine.StateMachine.

So what I described is just that this

execute current state -> next_state -> transition_to next_state 
  ^                                           |
   --------------------------------------------

is happening all over again. And the order of states depends on the callback result and the logic of the state.

Connecting it back to the CalcJob

So running a process just means creating an initial state and then execute it asynchronously, return a new state (the type of state depends on the result), execute it asynchronously, return a new state, execute it and so on till the Finished state is reached. This can already be helpful for debugging as you can just put breakpoints to the execute methods of all process states and check what is happening (e.g. checking the run_fn function in the Running class). For example, you can figure out that for the CalcJob the initial state is the Created state, then the Running state with aiida.engine.processes.calcjobs.calcjob.CalcJob.run as run_fn function. If you check this method, you can see that it returns the Waiting state. Be aware that the default Waiting class is overridden by the one in aiida.engine. Now it will just reenter the waiting state for the different commands (upload, submit, update, stash and retrieve). The actual logic of what happens can be found in the function passed to self._launch_task(<function_with_actual_logic>, ...) in the corresponding if cases that check the command. The _lauch_task just adds the passed function to the current event loop so it gets executed. In the end it goes back to running state with the parse function as run_fn and then finishes.

To make this complete, when using aiida.engine.run to run a CalcJob, the Process.execute method is invoked in aiida.engine.runners.Runner._run which starts this whole described process. Thats it, easy right?

Topic		Replies	Views
How to debug `engine.submit` processes? General Usage	2	24	October 4, 2024
Obtain a `CalcJob` instance from a corresponding `CalcJobNode` or builder General Usage	3	86	February 27, 2024
Excepted workchains, due to strange error from kiwipy/plumpy? General Usage	33	367	September 18, 2024
Processes get expected due to memory error from plumpy General Usage	2	31	June 17, 2025
Handling SLURM Job Failures in AiiDA: Detection and Automatic Restart of Stalled Calculations General Usage aiida	2	54	August 9, 2024

A walk through calcjob, from beginning to the end?

Process state logic in plumpy

Connecting it back to the CalcJob

Related topics