Show stored data in provenance graph

Hi, I am working on implementing a Calcjob for a plugin and I was trying to do the following.
My Code (as in the aiida data type code) can take arguments in the command line or in a config file. So when I create the submission script it will run
code --structure "struc.cif" --other-parameter "parameter" --etc "etc
or it can also be used as
code --config "config.yaml"
when I create the calcjob I want to implement both, but also I want to save some info inside the config file for the provenance graph.
So basically what I did is that I go through the file and for each parameter I convert it to an aiida data type and do parameter.store(). But this will not appear in the provenance graph, as the Input is actually just the config file (that I input as a SinglefileData type), so the parameters I stored are kind of lost. But also if I try to save them as inputs after I convert them I can’t cause the inputs dictionary cannot be modified (obviously cause in theory everything I do after passing the inputs is not input anymore).

Is there a better way to deal with this than what I am trying to do? Or just a way to show the new data I stored as somewhat connected to the calculation I’m running?

Let me see if I understand the situation correclty. Your CalcJob currently accepts a single input, which is a SinglefileData containing a YAML with the input parameters for the code the plugin is wrapping. So in the provenance graph you see a single input (forgetting about the Code node input for simplicity now). But you want to “inwrap” the parameter dictionary that is in the YAML in separate input nodes? i.e., assuming the YAML contains this:

parameter_1: 'some_string'
parameters_2: 15

you would like to actually have two additional input nodes a Str('some_string') and Int(15) attached to the calculation?

If that is indeed your intention, the question is, why do you want to do that?

Basically yes. I want to do this just because I think it looks nicer.
When I run the calculation with the config I will see a graph/verdi node show output looking like
Inputs:Config → Calcjob → Outputs
But if I input the inputs one by one (which is also an option) it will be
Inputs:Input1,Input2… → Calcjob → Outputs, and I like it better cause you immediately can see important info (like what structure you are using, what MLIP (that’s what the plugin is for) model you are using etc). So when I pass the config file I like the idea of being able to see something like
Inputs: Input1,Input2…----Config → Calcjob → Outputs
I also understand that maybe this is redundant, cause the information is already in the config file anyway, it’s more for visualising better rather than for actually any practical purposes.
But If this is outside the scope of AiiDA I’ll just leave it like this

I see. Well there is no way to create the input nodes inside the CalcJob.prepare_for_submission after the process has been created. One alternative you could consider, is the following:


from pathlib import Path
from aiida import engine, orm

class SomeCalcJob(engine.CalcJob):

    @classmethod
    def get_builder_from_yaml(cls, filepath: Path):
        import yaml

        with filepath.open() as handle:
            data = yaml.safe_load(handle)

        builder = cls.get_builder()
        builder.parameter_1 = data['parameter_1']
        builder.parameter_2 = data['parameter_2']


    @classmethod
    def define(cls, spec):
        super().define(spec)
        spec.input('parameter_1', valid_type=orm.Str, serializer=orm.to_aiida_type)
        spec.input('parameter_2', valid_type=orm.Int, serializer=orm.to_aiida_type)


builder = SomeCalcJob.get_builder_from_yaml(Path('/path/to/yaml'))
results = engine.run(builder)

So essentially you implement the get_builder_from_yaml classmethod that parses the YAML file and converts them to the inputs. The SomeCalcJob then only accepts explicit nodes and not a YAML file. But this approach would force you to define all the potential arguments that the YAML could accept. Could be a lot of work.

An alternative, that I think I myself would go for, is to simply define a single input on the plugin called parameters and have it accept a Dict.


from pathlib import Path
from aiida import engine, orm

class SomeCalcJob(engine.CalcJob):

    @classmethod
    def define(cls, spec):
        super().define(spec)
        spec.input('parameters', valid_type=orm.Dict, serializer=orm.to_aiida_type)

with Path('/path/to/yaml').open() as handle:
    parameters = yaml.safe_load(handle)

builder = SomeCalcJob.get_builder()
builder.parameters = parameters
results = engine.run(builder)

This is the simplest solution and you don’t risk that your implementation doesn’t declare an input that the code actually supports, but now the user cannot pass it. The advantage of the Dict instead of the SinglefileData node is that it is now very easy to show the actual parameters that were used and to query for them. So it would allow you to do

QueryBuilder().append(SomeCalcJob, tag='c').append(Dict, with_outgoing='c', filters={'attributes.parameter_1' == 'some_string').all(flat=True)

Great, thank you, yeah I was considering the get_builder thing but as you said it can get complicated, I’ll probably just go with the dictionary or something like that

How about using a dynamic input namespace?

class SomeCalcJob(engine.CalcJob):

    @classmethod
    def define(cls, spec):
        super().define(spec)
        spec.input_namespace('config', dynamic=True)

    @classmethod
    def get_builder_from_yaml(cls, filepath: Path):

        with filepath.open() as handle:
            data = yaml.safe_load(handle)

        builder = cls.get_builder()

        for parameter, value in data.items():
            builder['config'][parameter] = orm.to_aiida_type(value)
        
        return builder

This would allow you to have the content of the YAML file as separate inputs in the provenance. I made a full example below:

import yaml
from pathlib import Path

from aiida import orm, engine, load_profile
from aiida.common import datastructures, NotExistent

load_profile()

def to_python_type(node):
    if isinstance(node, (orm.Str, orm.Int, orm.Float)):
        return node.value
    elif isinstance(node, orm.List):
        return node.get_list()
    elif isinstance(node, orm.Dict):
        return node.get_dict()
    else:
        raise TypeError(f'Unsupported type: {type(node)}')


class SomeCalcJob(engine.CalcJob):

    @classmethod
    def define(cls, spec):
        super().define(spec)
        spec.input_namespace('config', dynamic=True)

    @classmethod
    def get_builder_from_yaml(cls, filepath: Path):

        with filepath.open() as handle:
            data = yaml.safe_load(handle)

        builder = cls.get_builder()

        for parameter, value in data.items():
            builder['config'][parameter] = orm.to_aiida_type(value)
        
        return builder

    def prepare_for_submission(self, folder):

        codeinfo = datastructures.CodeInfo()
        codeinfo.code_uuid = self.inputs.code.uuid

        cmdline_params = []
        for parameter, value in self.inputs.config.items():
            cmdline_params.extend([f'--{parameter}', to_python_type(value)])
        codeinfo.cmdline_params = cmdline_params

        calcinfo = datastructures.CalcInfo()
        calcinfo.codes_info = [codeinfo]
        calcinfo.retrieve_list = []

        return calcinfo

try:
    code = orm.load_code('code@localhost')
except NotExistent:
    code = orm.InstalledCode(
        computer=orm.load_computer('localhost'),
        filepath_executable='/bin/echo',
        label='code'
    )
    code.store()

builder = SomeCalcJob.get_builder_from_yaml(Path('config.yaml'))
builder.code = code
builder.metadata.options.resources = {'num_machines': 1, 'num_mpiprocs_per_machine': 1}

results, node  = engine.run_get_node(builder)

print(node.base.repository.get_object_content('_aiidasubmit.sh'))

It only assumes you’ve set up the localhost and that there is a config.yaml file in the current directory with e.g. the following contents:

a: 1
b: str
c:
  - 1
  - 2
  - 3
d:
  e: 1
  f: 2
  g: 3

The provenance graph will be this:

Two notes:

  1. The link INPUT_CALC link labels will be config__a (double underscore!) etc. This might be useful to know when building queries based on edge_filters.

  2. Note that my automatic conversion with to_python_type probably doesn’t create the most sensible _aiidasubmit.sh file:

    #!/bin/bash
    exec > _scheduler-stdout.txt
    exec 2> _scheduler-stderr.txt
    
    
    '/bin/echo' '--a' '1' '--b' 'str' '--c' '[1, 2, 3]' '--d' '{'"'"'e'"'"': 1, '"'"'f'"'"': 2, '"'"'g'"'"': 3}'
    

    But I mostly wanted to show the effect for all basic Python types.