Automatically storing info about Python environment in nodes for calcfunctions

When I design my calcfunctions I try as much as possible to follow the guidelines in the documentation about reproducibility: Process functions — AiiDA 2.6.1 documentation

However, I often do have to have to import some functionality from external modules such as numpy, scipy or sisl. Contrary to aiida-core and any plugins, the versions of these external Python packages is currently not recorded by AiiDA.

I have been trying to think of ways to record information about this in my calcfunction nodes. There are ways of listing the installed packages in your environment from a script, for example by using pkg_resources (https://www.w3resource.com/python-exercises/basic/python-basic-1-exercise-9.php). This seems to work well for environments created with virtualenv, but I don’t know how best to store this information. It would be possible to store it in the node.base.extras after the node has been created, similar to how the version attribute stores the versions of core and plugins of aiida. However, I’m not sure if this might cause you to store too much data in your postgresql database if you store this for every calcfunctionnode and if your virtual environment contains a lot of packages. Another idea would be to store the list of pacakges and versions in the repository relate this somehow to the calcfunctionnode, but then I don’t know how to ensure that this information would also get exported if you export the calcfunctionnode. In addition, most of these approaches still require you to manually attach this information to the node after it has been created instead of being automatically recorded.

Have other people thought about this and maybe have good solutions?

Hi @ahkole , very good question. This information is indeed not stored and does reduce the reproducibility. Note that even though we store the versions of aiida-core and plugin packages, their plugin implementations can also use those types of Python packages and they are also not recorded. For example, if aiida-quantumespresso uses scipy in a Parser or CalcJob plugin, it will record the version of aiida-quantumespresso, but if that package does not literally pin the version of scipy (which is almost never done in practice for good reasons) you still won’t know exactly with what version of scipy it was run.

This is not an easy problem to solve well, which is why way back then we decided not to include the storing of the entire Python environment in the provenance. That is just some historical context, and that being said, it is still interesting to see if we could still add this as optional functionality.

This would indeed be an option. There is indeed the highlight that this could slow down queries if the content you are storing is significant. If you store a full dump of all packages instead in the environment, that may quickly grow out of hand since a typical AiiDA-based Python environment has quite a number of packages. Of course you could decide to store just a selection of the packages that you think are most relevant.

Besides that, there is always the risk of losing the information since extras are mutable and don’t have the immutability limitations. So it wouldn’t be the safest solution but if you are handling your own data and make sure you don’t overwrite extras, it could still be ok.

This would be an interesting alternative as it would solve the problem of immutability (the repo is immutable once the node is stored, just as the attributes) and you avoid overloading the database. The repository contents are automatically attached to its node just like the attributes, and so when exporting/importing that information is perfectly preserved. The real problem here though is that once the node is stored, the repo is immutable, so from the calcfunctions body it is impossible to add information to it.

Even if it was not immutable, with the current public API it is not even really possible because to add to the repository or extras, you would need access to the CalcFunctionNode instance inside the function body:

@calcfunction
def perfect_provenance():
    # how do I get the `CalcFunctionNode` here to be able to do:
    node.base.repository.put_object_from_string(
        'package_list', 
        io.BytesIO(package_version_dump.encode(b'scipy==1.2.0,....')
    )

There is a hack but it is not part of the public API and I am not sure if it is safe, so I wouldn’t recommend it, but I put it here for informational purposes:

@calcfunction
def perfect_provenance():
    from aiida.engine import Process
    # Retrieve the current process on the stack, which should be this function
    process = Process.current()
    # From that, we can access the associated node
    node = process.node
    # And now we can update the extras
    node.base.extras.set('package', 'scipy==1.2.0')

If we think this is a valuable feature, it would seem that it requires some functionality in aiida-core. Maybe we could add a metadat input for calcfunctions that when enabled, it would automatically record the environment information on the node. This way it would automatically be available for all calcfunctions and the user wouldn’t have to implement it each time. E.g.

@calcfunction
def add(a: int, b: int):
    return a + b

results, node = add.run_get_node(1, 2, metadata={'store_environment_information': True})
print(node.get_environment_info)

and that would then print the env info representation.

It would be an interesting feature request, but I cannot guarantee that it would be implemented. We would have to think hard how we can make this robust such that it works on all platforms and all types of environments and that we store the right amount information in the right way.

Hi @sphuber, thank you for the very elaborate reply! I understand that it is a difficult problem to solve. Nevertheless, I still think it would be a valuable feature to have in some form. If I wanted to submit a feature request for this, would I have to do this on the GitHub of aiida-core?

For the time being I think the best option for me would be to manage the information about the actual packages that I import (which is usually a small list) in the extras of the node. That would not give perfect provenance, but I feel it’s already a lot better than storing no information about the packages imported by the function. And it does not fill up the database with too much information. I also feel more comfortable trusting myself to manage the extras properly than to use a (potentially) unsafe hack which you recommended against.

@sphuber I tried creating a decorator that I can use to automatically store a dump of the environment info in the node.base.extras of the CalcFunctionNode whenever I call a calcfunction:

from aiida.plugins import DataFactory
from importlib import metadata
import tempfile
import os.path as path


def store_env(calcfunc):
    """Automatically store dump of Python environment
    when calling AiiDA calcfunc"""
    def wrapper(*args, **kwargs):
        # Call calcfunction
        res = calcfunc(*args, **kwargs)

        # Obtain CalcFunctionNode
        if isinstance(res, dict):
            # Result is a dictionary, obtain Data node from this
            node = next(iter(res.values()))
        else:
            # Result is itself a Data node
            node = res
        funcnode = node.creator

        # Get info about Python environment and dump to file
        installed_packages = metadata.distributions()
        installed_packages_list = sorted(
            ["%s==%s" % (i.name, i.version) for i in installed_packages],
            key=str.casefold,
        )
        with tempfile.TemporaryDirectory() as tmpdir:
            fname = path.join(tmpdir, 'requirements.txt')
            with open(fname, 'w') as fh:
                for m in installed_packages_list:
                    fh.write(f'{m}\n')
            SinglefileData = DataFactory('core.singlefile')
            env_file = SinglefileData(fname)

        # Attach file to CalcFunctionNode
        env_file.store()
        funcnode.base.extras.set('env_info_uuid', env_file.uuid)

        # Return result of calcfunc to caller
        return res

    return wrapper

I can then use this to decorate a calcfunction like this:

from aiida.engine import calcfunction
from .utils import store_env


@store_env
@calcfunction
def testfunc(x: int, y: int):
    return {'sum': x+y, 'diff': x-y}

Do you think this decorator would be safe to use with AiiDA? Or am I doing something which would be strongly discouraged?

This method is far from foolproof, the method I’m currently using to obtain the instance of the CalcFunctionNode is probably quick to break during AiiDA updates and I still have to remember myself to also export all these SinglefileData nodes if I want to export my data to an archive. Nevertheless, it might be better than nothing.

Hi @ahkole , there is nothing intrinsically dangerous about your approach. It is also not likely to break due to changes internal to aiida-core in the future. This is because you are really just adding a wrapper around the calcfunction decorator and using the extras of the CalcFunctionNode to loosely add a SinglefileData node.

The main downsides are as you already mentioned that since the “link” with the env data node is just through extras, it is not automatically discovered when exporting and this is a manual step you have to perform. Also, as mentioned before, the link being just the UUID in the extras can be fragile as that extra can be changed or lost.

Finally, I am not a 100% sure how this approach would work when users call a calcfunction through the run or run_get_node attributes, i.e.:

@store_env
@calcfunction
def testfunc(x: int, y: int):
    return {'sum': x+y, 'diff': x-y}

results, node = testfunc.run_get_node(1, 2)

In this case, the return value would be a tuple of the result and the CalcFunctionNode, but I am not sure how this works with your store_env decorator. Something you may want to test.

Hi @sphuber, thanks for taking a look at my trial implementation.

The main downsides are as you already mentioned that since the “link” with the env data node is just through extras, it is not automatically discovered when exporting and this is a manual step you have to perform. Also, as mentioned before, the link being just the UUID in the extras can be fragile as that extra can be changed or lost.

Yes, I think the only way to make this information robustly and immutably attached to the CalcFunctionNode would be to store it in the repository of the node itself before it gets stored in the database. But as you mentioned before this probably requires a change to aiida-core itself and is not something I can write a utility wrapper for myself.

Finally, I am not a 100% sure how this approach would work when users call a calcfunction through the run or run_get_node attributes, i.e.:

Interesting point. I have to admit I did not properly realize that after decorating a function with @calcfunction it is no longer a “regular” Python function but an instance of ProcessFunctionType instead. I wasn’t even aware that you could call it using these run and run_get_node attributes :sweat_smile:. This does give me the idea to use run_get_node inside my decorator to more cleanly get access to the instance of the CalcFunctionNode after calling the function. If I understand decorators correctly I think that after using my @store_env decorator the function has been returned to a “regular” Python function again so I don’t think it would even be possible to still use those attributes (but I will test). So I guess my approach might have been too simple and will lead to a loss of functionality since you no longer have a ProcessFunctionType instance. Nevertheless, for most of my use cases it is probably sufficient to only be able to use it as a regular Python function. In any case, I will have a look to see if I can improve the decorator.

FYI, I just tested and if I try testfunc.run_get_node(1, 2) as expected I get the error, 'function' object has no attribute 'run_get_node'.

Yes, I think you would have to manually “readd” those additional attributes that we add to the ProcessFunctionType to the wrapped function that your decorator returns.

Small update. I changed my approach a little bit. Instead of decorating the calcfunction returned by @calcfunction I now instead first decorate the regular Python function to add an additional output port with the dump of the Python environment and pass that to the @calcfunction decorator. The new approach looks like this,

from aiida.plugins import DataFactory
from importlib import metadata
import io
from functools import wraps


def store_env(calcfunc):
    @wraps(calcfunc)  # Assign calcfunc attributes to wrapper for use in calcfunction decorator
    def wrapper(*args, **kwargs):
        # Call calcfunction
        res = calcfunc(*args, **kwargs)

        # Get info about Python environment and dump to file
        installed_packages = metadata.distributions()
        installed_packages_list = '\n'.join(sorted(
            ["%s==%s" % (i.name, i.version) for i in installed_packages],
            key=str.casefold,
        ))
        FolderData = DataFactory('core.folder')
        fd = FolderData()
        fd.put_object_from_filelike(
            io.BytesIO(installed_packages_list.encode('utf-8')),
            path='requirements.txt'
            )

        # Add env dump as additional output
        if isinstance(res, dict):
            res['python.env_dump'] = fd
        else:
            new_res = {'result': res, 'python.env_dump': fd}
            res = new_res
        return res

    return wrapper

To get it to work properly it was essential to use wraps from functools to make sure the wrapper that is returned by @store_env looks like the wrapped function to @calcfunction. My initial tests show that all functionality of the calcfunctions still works (i.e. type validation, automatic input serialization, docstring parsing and variadic arguments) as in the following examples,

from aiida.engine import calcfunction
from .utils import store_env
from aiida.plugins import DataFactory


@calcfunction
@store_env
def testfunc(x: int, y: int):
    """Add and subtract two integers.

    :param x: Left hand operand.
    :param y: Right hand operand.
    """
    return {'sum': x+y, 'diff': x-y}


@calcfunction
@store_env
def testfunc2(x: str):
    return DataFactory('core.str')(f'Hello, {x.value}!')


@calcfunction
@store_env
def testfunc3(*args, **kwargs):
    """Sums all the input.
    """
    res = sum(args)
    for k, v in kwargs.items():
        res += v
    return res

An added benefit of the new approach is that now the dump is stored in the node repository and is therefore immutable and automatically exported with the CalcFunctionNode. I also think this new approach might work with @task.calcfunction from aiida-workgraph, but I have not yet tested that. It’s still not perfect of course because it adds the dump as an output, even though it is not really an output, but it might be the best I can do without modifying aiida-core.

What do you think @sphuber? Does it look like an okay and safe (i.e. not breaking AiiDA) approach?

1 Like

Looks like a really clean solution! Think that should be about as good as you can do without changing internals. Good job :+1: