Trying to understand AiiDA's restart from checkpoints capability

General description of the problem

I am trying to understand AiiDA’s restart from checkpoint functionality. From the aiida core readthedocs it is hard for me to understand how this works in practice so I am trying to create a minimal example and I am having problems to make it work properly or I understand something wrong.

Steps to reproduce

The minimal example looks like this

from aiida.engine import ToContext, WorkChain, calcfunction
from aiida.orm import AbstractCode, Int
from aiida.plugins.factories import CalculationFactory
from aiida.manage.configuration import get_config
from aiida.engine import submit, run
from aiida import orm, load_profile
import time

ArithmeticAddCalculation = CalculationFactory('core.arithmetic.add')

class AddThreeNumbersWorkChain(WorkChain):
    def define(cls, spec):

        spec.input('x', valid_type=Int)
        spec.input('y', valid_type=Int)
        spec.input('z', valid_type=Int)
        spec.input('code', valid_type=AbstractCode)
        spec.output('result', valid_type=Int)
    def add_xy(self):
        print("Run add_xy")
        inputs = {'x': self.inputs.x, 'y': self.inputs.y,
                  'code': self.inputs.code}
        add_xy_job_node = self.submit(ArithmeticAddCalculation, **inputs) # calc job node
        return ToContext(add_xy_job_node=add_xy_job_node)

    def add_xyz(self):
        print("Run add_xyz")
        # raise ValueError("Some bug")
        inputs = {'x': self.ctx.add_xy_job_node.outputs.sum,
                  'y': self.inputs.z, 'code': self.inputs.code}
        add_xyz_job_node = self.submit(ArithmeticAddCalculation, **inputs)
        return ToContext(add_xyz_job_node=add_xyz_job_node)

    def result(self):
        self.out('result', self.ctx.add_xyz_job_node.outputs.sum)

and the file I use to run

from aiida.engine import ToContext, WorkChain, calcfunction
from aiida.orm import AbstractCode, Int
from aiida.plugins.factories import CalculationFactory
from aiida.manage.configuration import get_config
from aiida.engine import submit, run
from aiida import orm, load_profile

from workchain_minimal import AddThreeNumbersWorkChain

builder = AddThreeNumbersWorkChain.get_builder()
builder.code = orm.load_code(label='add')
builder.x = orm.Int(2)
builder.y = orm.Int(3)
builder.z = orm.Int(5)

result = run(builder)

Now in the in the add_xyz function I uncommented the raising of the ValueError. Run verdi run Now I am following the instructions from the tutorial using quantum espresso and restart the process with the script

from aiida.engine import run, submit
failed_calculation = load_node(<PK_OF_FAILED_PROCESS>)
restart_builder = failed_calculation.get_builder_restart()
calcjob_node = run(restart_builder)


From the prints I see that the add_xy is again executed, which I assumed would be skipped, since it worked the first time. If someone could clarify my misunderstanding of the restart from checkpoint functionality of AiiDA or explain me what I do wrong in my script, I would be very grateful.


 ✔ version:     AiiDA v2.5.1.post0
 ✔ config:      /home/alexgo/code/aiida-core/.aiida
 ✔ profile:     alexgo
 ✔ storage:     SqliteDosStorage[/home/alexgo/code/aiida-core/.aiida/repository/sqlite_dos_f275ff0f10174e8e8
4c13e82dd5ba452]: open,
 ✔ broker:      RabbitMQ v3.8.14 @ amqp://guest:guest@
 ✔ daemon:      Daemon is running with PID 608674

Hi @agoscinski

To my understanding, the documentation about the checkpoints that you linked refers to the internal handling of the processes. Every time a process changes its state, a representation of the process (in memory) is stored as a node in the database. As stated there, those checkpoints allow AiiDA to continue a process after a reboot of your machine or similar. This being said, those checkpoints are more related to the continuation of an “active” process. Others can probably provide deeper insights into this, in case it’s relevant for you.

If I understand your example correctly, you tested and expected that the first CalcJob is skipped, since a successful run already exists in the database. Please correct me if I’m wrong. In case my understanding is correct, you should check the concept of caching How to run external codes — AiiDA 2.5.1.post0 documentation, as this seems to be what you are looking for instead of the previously mentioned checkpoints.
Once you enable caching, AiiDA will actually use the results of a previous CalcJob but only if you use the exact same inputs. You can find further details about how a valid caching source is determined here: Caching and hashing — AiiDA 2.5.1.post0 documentation

Hi @agoscinski . There seems to be, a very understandable, confusion about what restarting means in AiiDA. There are a couple of concepts that all relate to restarting of processes. @t-reents hit the nail on the head that the main difference here is restarting an active or a terminated process.

The section in the docs that you linked refers to how AiiDA can restart (or better would be continue running) an active process. This only really applies to processes that are run by the daemon and then that daemon worker gets killed (either the system process gets killed or the entire computer is shutdown). Restarting in this context means that as soon as the daemon is restarted, it notices that there is an active process and it will reconstruct the Process instance in memory from the checkpoint stored in the database and continue where it left off.

In your example, the workchain that raises an exception will actually terminate. Its final state will be excepted. At this point it is no longer an active process. When you call run again, you are creating a new instance, and so it starts from the beginning of the workchain. Notice also how a separate WorkChainNode is generated in the database.

The restarting of active processes is in principle possible to be performed by the user, but it is not intended to be. This is really reserved for the daemon.

Finally, another related concept to restarting is the caching mechanism. As mentioned by @t-reents , this feature (when enabled) will “skip” steps in a workchain if they have already been completed with identical inputs before. This is useful in cases where a workchain completes a number of steps successfully, but then encounters an exception. The workchain is now terminated and the exact same process cannot be restarted. However, when you launch it again, creating a new process instance, the caching mechanism allows to skip the steps that were completed successfully. This makes for an efficient “restart” but it is crucial to understand that this is a new iteration of the workchain and not a restart of the old failed one.

1 Like