Automatically storing info about Python environment in nodes for calcfunctions

ahkole · July 16, 2024, 2:15pm

When I design my calcfunctions I try as much as possible to follow the guidelines in the documentation about reproducibility: Process functions — AiiDA 2.6.1 documentation

However, I often do have to have to import some functionality from external modules such as numpy, scipy or sisl. Contrary to aiida-core and any plugins, the versions of these external Python packages is currently not recorded by AiiDA.

I have been trying to think of ways to record information about this in my calcfunction nodes. There are ways of listing the installed packages in your environment from a script, for example by using pkg_resources (https://www.w3resource.com/python-exercises/basic/python-basic-1-exercise-9.php). This seems to work well for environments created with virtualenv, but I don’t know how best to store this information. It would be possible to store it in the node.base.extras after the node has been created, similar to how the version attribute stores the versions of core and plugins of aiida. However, I’m not sure if this might cause you to store too much data in your postgresql database if you store this for every calcfunctionnode and if your virtual environment contains a lot of packages. Another idea would be to store the list of pacakges and versions in the repository relate this somehow to the calcfunctionnode, but then I don’t know how to ensure that this information would also get exported if you export the calcfunctionnode. In addition, most of these approaches still require you to manually attach this information to the node after it has been created instead of being automatically recorded.

Have other people thought about this and maybe have good solutions?

sphuber · July 16, 2024, 6:01pm

Hi @ahkole , very good question. This information is indeed not stored and does reduce the reproducibility. Note that even though we store the versions of aiida-core and plugin packages, their plugin implementations can also use those types of Python packages and they are also not recorded. For example, if aiida-quantumespresso uses scipy in a Parser or CalcJob plugin, it will record the version of aiida-quantumespresso, but if that package does not literally pin the version of scipy (which is almost never done in practice for good reasons) you still won’t know exactly with what version of scipy it was run.

This is not an easy problem to solve well, which is why way back then we decided not to include the storing of the entire Python environment in the provenance. That is just some historical context, and that being said, it is still interesting to see if we could still add this as optional functionality.

This would indeed be an option. There is indeed the highlight that this could slow down queries if the content you are storing is significant. If you store a full dump of all packages instead in the environment, that may quickly grow out of hand since a typical AiiDA-based Python environment has quite a number of packages. Of course you could decide to store just a selection of the packages that you think are most relevant.

Besides that, there is always the risk of losing the information since extras are mutable and don’t have the immutability limitations. So it wouldn’t be the safest solution but if you are handling your own data and make sure you don’t overwrite extras, it could still be ok.

This would be an interesting alternative as it would solve the problem of immutability (the repo is immutable once the node is stored, just as the attributes) and you avoid overloading the database. The repository contents are automatically attached to its node just like the attributes, and so when exporting/importing that information is perfectly preserved. The real problem here though is that once the node is stored, the repo is immutable, so from the calcfunctions body it is impossible to add information to it.

Even if it was not immutable, with the current public API it is not even really possible because to add to the repository or extras, you would need access to the CalcFunctionNode instance inside the function body:

@calcfunction
def perfect_provenance():
    # how do I get the `CalcFunctionNode` here to be able to do:
    node.base.repository.put_object_from_string(
        'package_list', 
        io.BytesIO(package_version_dump.encode(b'scipy==1.2.0,....')
    )

There is a hack but it is not part of the public API and I am not sure if it is safe, so I wouldn’t recommend it, but I put it here for informational purposes:

@calcfunction
def perfect_provenance():
    from aiida.engine import Process
    # Retrieve the current process on the stack, which should be this function
    process = Process.current()
    # From that, we can access the associated node
    node = process.node
    # And now we can update the extras
    node.base.extras.set('package', 'scipy==1.2.0')

If we think this is a valuable feature, it would seem that it requires some functionality in aiida-core. Maybe we could add a metadat input for calcfunctions that when enabled, it would automatically record the environment information on the node. This way it would automatically be available for all calcfunctions and the user wouldn’t have to implement it each time. E.g.

@calcfunction
def add(a: int, b: int):
    return a + b

results, node = add.run_get_node(1, 2, metadata={'store_environment_information': True})
print(node.get_environment_info)

and that would then print the env info representation.

It would be an interesting feature request, but I cannot guarantee that it would be implemented. We would have to think hard how we can make this robust such that it works on all platforms and all types of environments and that we store the right amount information in the right way.

ahkole · July 18, 2024, 3:50pm

Hi @sphuber, thank you for the very elaborate reply! I understand that it is a difficult problem to solve. Nevertheless, I still think it would be a valuable feature to have in some form. If I wanted to submit a feature request for this, would I have to do this on the GitHub of aiida-core?

For the time being I think the best option for me would be to manage the information about the actual packages that I import (which is usually a small list) in the extras of the node. That would not give perfect provenance, but I feel it’s already a lot better than storing no information about the packages imported by the function. And it does not fill up the database with too much information. I also feel more comfortable trusting myself to manage the extras properly than to use a (potentially) unsafe hack which you recommended against.

ahkole · July 31, 2024, 5:12pm

@sphuber I tried creating a decorator that I can use to automatically store a dump of the environment info in the node.base.extras of the CalcFunctionNode whenever I call a calcfunction:

from aiida.plugins import DataFactory
from importlib import metadata
import tempfile
import os.path as path


def store_env(calcfunc):
    """Automatically store dump of Python environment
    when calling AiiDA calcfunc"""
    def wrapper(*args, **kwargs):
        # Call calcfunction
        res = calcfunc(*args, **kwargs)

        # Obtain CalcFunctionNode
        if isinstance(res, dict):
            # Result is a dictionary, obtain Data node from this
            node = next(iter(res.values()))
        else:
            # Result is itself a Data node
            node = res
        funcnode = node.creator

        # Get info about Python environment and dump to file
        installed_packages = metadata.distributions()
        installed_packages_list = sorted(
            ["%s==%s" % (i.name, i.version) for i in installed_packages],
            key=str.casefold,
        )
        with tempfile.TemporaryDirectory() as tmpdir:
            fname = path.join(tmpdir, 'requirements.txt')
            with open(fname, 'w') as fh:
                for m in installed_packages_list:
                    fh.write(f'{m}\n')
            SinglefileData = DataFactory('core.singlefile')
            env_file = SinglefileData(fname)

        # Attach file to CalcFunctionNode
        env_file.store()
        funcnode.base.extras.set('env_info_uuid', env_file.uuid)

        # Return result of calcfunc to caller
        return res

    return wrapper

I can then use this to decorate a calcfunction like this:

from aiida.engine import calcfunction
from .utils import store_env


@store_env
@calcfunction
def testfunc(x: int, y: int):
    return {'sum': x+y, 'diff': x-y}

Do you think this decorator would be safe to use with AiiDA? Or am I doing something which would be strongly discouraged?

This method is far from foolproof, the method I’m currently using to obtain the instance of the CalcFunctionNode is probably quick to break during AiiDA updates and I still have to remember myself to also export all these SinglefileData nodes if I want to export my data to an archive. Nevertheless, it might be better than nothing.

sphuber · August 1, 2024, 1:26pm

Hi @ahkole , there is nothing intrinsically dangerous about your approach. It is also not likely to break due to changes internal to aiida-core in the future. This is because you are really just adding a wrapper around the calcfunction decorator and using the extras of the CalcFunctionNode to loosely add a SinglefileData node.

The main downsides are as you already mentioned that since the “link” with the env data node is just through extras, it is not automatically discovered when exporting and this is a manual step you have to perform. Also, as mentioned before, the link being just the UUID in the extras can be fragile as that extra can be changed or lost.

Finally, I am not a 100% sure how this approach would work when users call a calcfunction through the run or run_get_node attributes, i.e.:

@store_env
@calcfunction
def testfunc(x: int, y: int):
    return {'sum': x+y, 'diff': x-y}

results, node = testfunc.run_get_node(1, 2)

In this case, the return value would be a tuple of the result and the CalcFunctionNode, but I am not sure how this works with your store_env decorator. Something you may want to test.

ahkole · August 1, 2024, 2:10pm

Hi @sphuber, thanks for taking a look at my trial implementation.

The main downsides are as you already mentioned that since the “link” with the env data node is just through extras, it is not automatically discovered when exporting and this is a manual step you have to perform. Also, as mentioned before, the link being just the UUID in the extras can be fragile as that extra can be changed or lost.

Yes, I think the only way to make this information robustly and immutably attached to the CalcFunctionNode would be to store it in the repository of the node itself before it gets stored in the database. But as you mentioned before this probably requires a change to aiida-core itself and is not something I can write a utility wrapper for myself.

Finally, I am not a 100% sure how this approach would work when users call a calcfunction through the run or run_get_node attributes, i.e.:

Interesting point. I have to admit I did not properly realize that after decorating a function with @calcfunction it is no longer a “regular” Python function but an instance of ProcessFunctionType instead. I wasn’t even aware that you could call it using these run and run_get_node attributes . This does give me the idea to use run_get_node inside my decorator to more cleanly get access to the instance of the CalcFunctionNode after calling the function. If I understand decorators correctly I think that after using my @store_env decorator the function has been returned to a “regular” Python function again so I don’t think it would even be possible to still use those attributes (but I will test). So I guess my approach might have been too simple and will lead to a loss of functionality since you no longer have a ProcessFunctionType instance. Nevertheless, for most of my use cases it is probably sufficient to only be able to use it as a regular Python function. In any case, I will have a look to see if I can improve the decorator.

ahkole · August 1, 2024, 2:17pm

FYI, I just tested and if I try testfunc.run_get_node(1, 2) as expected I get the error, 'function' object has no attribute 'run_get_node'.

sphuber · August 2, 2024, 4:58pm

Yes, I think you would have to manually “readd” those additional attributes that we add to the ProcessFunctionType to the wrapped function that your decorator returns.

ahkole · August 29, 2024, 4:21pm

Small update. I changed my approach a little bit. Instead of decorating the calcfunction returned by @calcfunction I now instead first decorate the regular Python function to add an additional output port with the dump of the Python environment and pass that to the @calcfunction decorator. The new approach looks like this,

from aiida.plugins import DataFactory
from importlib import metadata
import io
from functools import wraps


def store_env(calcfunc):
    @wraps(calcfunc)  # Assign calcfunc attributes to wrapper for use in calcfunction decorator
    def wrapper(*args, **kwargs):
        # Call calcfunction
        res = calcfunc(*args, **kwargs)

        # Get info about Python environment and dump to file
        installed_packages = metadata.distributions()
        installed_packages_list = '\n'.join(sorted(
            ["%s==%s" % (i.name, i.version) for i in installed_packages],
            key=str.casefold,
        ))
        FolderData = DataFactory('core.folder')
        fd = FolderData()
        fd.put_object_from_filelike(
            io.BytesIO(installed_packages_list.encode('utf-8')),
            path='requirements.txt'
            )

        # Add env dump as additional output
        if isinstance(res, dict):
            res['python.env_dump'] = fd
        else:
            new_res = {'result': res, 'python.env_dump': fd}
            res = new_res
        return res

    return wrapper

To get it to work properly it was essential to use wraps from functools to make sure the wrapper that is returned by @store_env looks like the wrapped function to @calcfunction. My initial tests show that all functionality of the calcfunctions still works (i.e. type validation, automatic input serialization, docstring parsing and variadic arguments) as in the following examples,

from aiida.engine import calcfunction
from .utils import store_env
from aiida.plugins import DataFactory


@calcfunction
@store_env
def testfunc(x: int, y: int):
    """Add and subtract two integers.

    :param x: Left hand operand.
    :param y: Right hand operand.
    """
    return {'sum': x+y, 'diff': x-y}


@calcfunction
@store_env
def testfunc2(x: str):
    return DataFactory('core.str')(f'Hello, {x.value}!')


@calcfunction
@store_env
def testfunc3(*args, **kwargs):
    """Sums all the input.
    """
    res = sum(args)
    for k, v in kwargs.items():
        res += v
    return res

An added benefit of the new approach is that now the dump is stored in the node repository and is therefore immutable and automatically exported with the CalcFunctionNode. I also think this new approach might work with @task.calcfunction from aiida-workgraph, but I have not yet tested that. It’s still not perfect of course because it adds the dump as an output, even though it is not really an output, but it might be the best I can do without modifying aiida-core.

What do you think @sphuber? Does it look like an okay and safe (i.e. not breaking AiiDA) approach?

sphuber · August 29, 2024, 8:10pm

Looks like a really clean solution! Think that should be about as good as you can do without changing internals. Good job

ahkole · June 27, 2025, 1:59pm

I just discovered that my approach leads to the wrong source code being stored in the calcfunction node. The source code of the store_env decorator is stored instead of the source code of the actual calcfunction. I therefore would discourage anyone from using the same approach.

@sphuber Do you have any idea how this could be fixed? Or does it require a change to the internals for this functionality to work?

giovannipizzi · June 30, 2025, 8:54am

Hi, after some debugging, I think that these two changes might work. Note that one is in your store_env code, the other however is inside AiiDA, but I am not sure if this can have side effects.

Therefore I suggest that you try to patch your AiiDA code and try it out for some time, and let us know if 1. it works and 2. you have side effects (especially when using with calcfunctions without your store_env.
We can then consider whether to include it in AiiDA core, if we assess that no side effects can be expdected.

Change to `store_env`

Add the following line just before returning wrapper:

    wrapper.__globals__['__name__'] = calcfunc.__globals__['__name__']

This ensures that the function_namespace stored in the calcfunction attribute is the correct one (the one of the wrapped function, and not the module where you defined store_env).

Changes to `aiida_core`

In src/aiida/orm/utils/mixins.py (line 62 as of today, click on the previous permalink to be sure of where), replace

source_file_path = inspect.getsourcefile(func)

with

source_file_path = inspect.getsourcefile(inspect.unwrap(func))

Let us know if this works!

ahkole · July 9, 2025, 4:25pm

Hi @giovannipizzi ,

Thanks for the great suggestion! I have patched my local aiida installation and have been trying it out and so far everything seems to work as expected now. Both the function_namespace and source_code seem to be set correctly. And it seems to work for both calcfunctions with and without the store_env wrapper. So I haven’t noticed any side effects for now but will continue testing in the following days/weeks.

I made a small modification to your suggestion by the way. Instead of setting the function namespace in the store_env wrapper, I have added an additional inspect.unwrap in aiida-core/src/aiida/orm/utils/mixins.py at a8230b45c91a45bf5e351626a15d7578aac1fa26 · aiidateam/aiida-core · GitHub (see also Unwrap calcfunction before retrieving source info · ahkole/aiida-core@6de3005 · GitHub for the exact changes I made). This seems to work and also no side effects noticed so far (fingers crossed).

On a side note, I have been noticed another downsides of my store_env approach, it doesn’t properly capture package versions for packages that were installed from a specific git commit (i.e. for my patched aiida-core). I have been trying to find ways of getting the correct info from within Python, but the only approaches that seem to correctly print the git commit from which a package was installed are command line tools, so I have now settled on running pipdeptree with subprocess.run to generate the requirements.txt:

from aiida.plugins import DataFactory
from findimports import find_imports
from importlib import metadata
import inspect
import io
import subprocess
from functools import wraps


def store_env(method='pipdeptree'):
    def store_env_imp(calcfunc):
        @wraps(calcfunc)  # Assign calcfunc attributes to wrapper for use in calcfunction decorator
        def wrapper(*args, **kwargs):
            # Call calcfunction
            res = calcfunc(*args, **kwargs)

            # Get info about Python environment and dump to file
            if method == 'metadata-distributions':
                installed_packages = metadata.distributions()
                installed_packages_list = '\n'.join(sorted(
                    ["%s==%s" % (i.name, i.version) for i in installed_packages],
                    key=str.casefold,
                ))
                byte_string = installed_packages_list.encode('utf-8')
            elif method == 'pipdeptree':
                mapping = metadata.packages_distributions()
                source_file_path = inspect.getsourcefile(inspect.unwrap(calcfunc))
                pkgs = {
                    pkg
                    for module in find_imports(source_file_path)
                    if module.name.split('.')[0] in mapping  # Some modules are not from packages but local or standard library
                    for pkg in mapping[module.name.split('.')[0]]
                }
                run_pipdeptree = subprocess.run(
                    f"pipdeptree -f -p {','.join(pkgs)} | sed 's/ //g' | sort -u",
                    shell=True,
                    check=True,
                    capture_output=True
                )
                byte_string = run_pipdeptree.stdout
            else:
                raise ValueError(f'Invalid method {method}. Accepted values: [pipdeptree, metadata-distributions]')

            FolderData = DataFactory('core.folder')
            fd = FolderData()
            fd.put_object_from_filelike(
                io.BytesIO(byte_string),
                path='requirements.txt'
            )

            # Add env dump as additional output
            if isinstance(res, dict):
                res['python.env_dump'] = fd
            else:
                new_res = {'result': res, 'python.env_dump': fd}
                res = new_res
            return res

        return wrapper

    return store_env_imp

As you can see I have also been trying to only include the actual packages that the calcfunction depends on by using findimports to analyze the source file. It’s not perfect and some things feel a bit hacky, but so far it works for my use case. Maybe it’s useful for others as well or could act as inspiration for an equivalent feature in aiida-core for storing environment info for calcfunctions.

Topic		Replies	Views
Show stored data in provenance graph Developer plugin	5	59	April 19, 2024
Use a common directory for a number of tasks General Usage	14	148	February 19, 2024
Usage of AiiDA-QE in multiple conda environments General Usage question , aiida	5	32	April 7, 2025
Aiida-workgraph: How to handle dynamic inputs and outputs of tasks General Usage	10	153	August 30, 2024
How to store which features a code was compiled with? General Usage	5	63	December 8, 2023

Automatically storing info about Python environment in nodes for calcfunctions

Change to store_env

Changes to aiida_core

Related topics

Change to `store_env`

Changes to `aiida_core`