Can somebody tell me what set of functions are being called exactly when a calcjob is being processed? from submission to monitor and retrieve?
what actually happens behind the scenes in plumpy, How are all sorts of tasks being lined up?
It would be perfect to have exact reference to code lines and not just abstract explanation.
Cheers
I will focus on explaining what is happening when using aiida.engine.run
with a CalcJob. For submitting it is mostly analogous but one needs to also consider kiwipy+rabbitmq logic that makes it more complicated. In principle the same things happen just with some events being fired so a worker can pick up the process. There is also the persister which I ignore since it is not so essential understanding the running procedure.
Process state logic in plumpy
Since CalcJob is inheriting from plumpy.processes.Process
we need to understand this class. Let us first focus on the process states that follow this logic:
___
| v
CREATED (x) --- RUNNING (x) --- FINISHED (o)
| ^ /
v | /
WAITING (x) --
| ^
---
You can find the state base class in plumpy.base.state_machine.State. The important method for now is only execute
. Its implementation for each state can be found in plumpy.process_states.py (the Create
, Runnig
, Waiting
and Finished
classes). Note that the execute
method returns another process state. It is a bit obfuscated by this create_state
function but it is simply a wrapper to create a new state instance. This wrapper is needed because you can overwrite the default state constructor (e.g when specializing it in the aiida.engine.processes.calcjobs.Waiting). Notice that the plumpy.process_states.Running class accepts a run_fn
argument which is executed in execute
. If the result of the run_fn
is a Command
instance, the type of Command determines in the _action_command method the next state. For example if it is a Continue
command, it will create a new Running
state that executes continue_fn
of the Command
instance. This is described in the graphic above by the arrow that self references the Running
state. For each of the connections in the graphic you find similar corresponding code in plumpy. You can already see that a sequence of states can be produced by just executing one state after the another, as long as the callback is returning states with callbacks that return states, but who executes these states?
It is the plumpy.processes.Process that invokes the execute methods from each process state. The Process class itself implements also a execute member function that does executes the execute member functions of the process states. The traceback is quite nested but not really complex in logic:
plumpy.processes.Process backtrace for execute()
execute()
-> step_until_terminated()
-> step()
-> _run_task(self._state.execute) # <--- execute of process state
-> wrap the provided callback to coroutine and await on it, the callback is the execute() method from a process state
It simply just continues running the execute
method of the process states until the Finished
state is reached. Furthermore in the step()
function the returned state is used to transition to the next state after executing a process. You can find the implementation of the transition_to
function in the Process base class the plumpy.base.state_machine.StateMachine.
So what I described is just that this
execute current state -> next_state -> transition_to next_state
^ |
--------------------------------------------
is happening all over again. And the order of states depends on the callback result and the logic of the state.
Connecting it back to the CalcJob
So running a process just means creating an initial state and then execute it asynchronously, return a new state (the type of state depends on the result), execute it asynchronously, return a new state, execute it and so on till the Finished
state is reached. This can already be helpful for debugging as you can just put breakpoints to the execute methods of all process states and check what is happening (e.g. checking the run_fn
function in the Running
class). For example, you can figure out that for the CalcJob
the initial state is the Created
state, then the Running
state with aiida.engine.processes.calcjobs.calcjob.CalcJob.run as run_fn
function. If you check this method, you can see that it returns the Waiting
state. Be aware that the default Waiting class is overridden by the one in aiida.engine. Now it will just reenter the waiting state for the different commands (upload
, submit
, update
, stash
and retrieve
). The actual logic of what happens can be found in the function passed to self._launch_task(<function_with_actual_logic>, ...)
in the corresponding if cases that check the command. The _lauch_task
just adds the passed function to the current event loop so it gets executed. In the end it goes back to running state with the parse
function as run_fn
and then finishes.
To make this complete, when using aiida.engine.run
to run a CalcJob
, the Process.execute
method is invoked in aiida.engine.runners.Runner._run which starts this whole described process. Thats it, easy right?