Hello, I had some questions related to reproducibility for a provenance graph that contains locally defined calcfunction’s, workfunction’s or WorkChain’s. Since these components are probably not published publicly somewhere, is enough information stored in the database for someone else, who doesn’t have access to the local files where these components were defined, to fully reproduce all the calculations and workflows? I know that calcfunction and workfunction automatically record the full source code of the function in the database, but what about WorkChain (i.e. workflows that you defined somewhere in your PYTHONPATH but that are not part of any published package)? Is something similar stored for those?
Hi, not by default. We’ve been discussing if and what we should record by default, and/or enforce on users. We came to the conclusion that already AiiDA comes with certain requirements and adding more would discourage most users whose main reason of using AiiDA is not perfect provenance tracking (but instead, e.g., the use of existing powerful workflows).
So, we accept that what we track is not perfect, but try to get a reasonable compromise (in terms of performance, complexity/simplicity for users, and actual reproducibility guarantees) and give ways to people who care about it to get to (almost) perfect reproducibility.
Specifically: for calcfunctions/workfunctions, as you say, we store the source code. Note that this by itself is not sufficient (it might import from other python files, it’s not stored which version of the dependencies were used, …). For CalcJobs and WorkChains, since they come with a python package (since you have to define the entry points in the pyproject.toml) we store in the attributes the version of the package (check verdi node attributes <PK>, you will find something like
"version": {
"core": "2.7.1.post0",
"plugin": "0.1.0"
}
where core is the version of AiiDA-core, and plugin the version of the CalcJob/WorkChain package.
If a user care about reproducibility, they should version their package and save the source code with appropriate tags for the versions (e.g. in git). Note that this does not require publishing the code (but still requires some diligence).
Other approaches are possible (and we are open to hear if there are easy and better ways to increase the reproducibility guarantees). One I would suggest to investigate if this is important for you (at least for calcjobs) is to containerize the code and use the support for containerized codes of AiiDA to run it remotely.
For WorkChains, you could investigate having a custom step that stores the source code (or anything else you feel appropriate) when the WorkChain is executed. This can be probably easily achieved “by hand” in a first step of the WorkChain outline, or one could probably have a simple subclass of WorkChain that does it by default (I didn’t think yet how exactly)
Thanks for the explanation! It sounds like this was well thought-out and from my own experience I’m pretty happy with the current implementation and the freedom that is available to tune the degree of reproducibility to my own needs. Are these considerations/compromises documented somewhere already? I know there is a section on provenance for process functions (Process functions — AiiDA 2.7.1 documentation). Is there something similar for Calcjob/WorkChain?
If a user care about reproducibility, they should version their package and save the source code with appropriate tags for the versions (e.g. in
git). Note that this does not require publishing the code (but still requires some diligence).
This might be a good approach for me. It would still at the very least be necessary that the source code is publicly accessible somewhere, right (i.e. a git repo)?
Other approaches are possible (and we are open to hear if there are easy and better ways to increase the reproducibility guarantees). One I would suggest to investigate if this is important for you (at least for calcjobs) is to containerize the code and use the support for containerized codes of AiiDA to run it remotely.
Interesting option that I had not considered yet. If I understand the docs correctly then the container image is not stored in the AiiDA repository, right? So the container image would have to be publicly available if you want it to be reproducible without someone having to contact you for the image? I was myself also thinking of increasing the reproducibility of installed codes. One of my ideas was to have a subclass for codes installed with EasyBuild or spack since those store quite some build reproducibility information that might be recorded automatically by aiida. Haven’t done anything concrete with that idea though yet.
For WorkChains, you could investigate having a custom step that stores the source code (or anything else you feel appropriate) when the WorkChain is executed. This can be probably easily achieved “by hand” in a first step of the WorkChain outline, or one could probably have a simple subclass of WorkChain that does it by default (I didn’t think yet how exactly)
Could be an interesting idea. Although I might find it easier to just package my WorkChains in a public git repo or similar with proper versioning.