Hi @ahkole , very good question. This information is indeed not stored and does reduce the reproducibility. Note that even though we store the versions of aiida-core
and plugin packages, their plugin implementations can also use those types of Python packages and they are also not recorded. For example, if aiida-quantumespresso
uses scipy
in a Parser
or CalcJob
plugin, it will record the version of aiida-quantumespresso
, but if that package does not literally pin the version of scipy
(which is almost never done in practice for good reasons) you still won’t know exactly with what version of scipy
it was run.
This is not an easy problem to solve well, which is why way back then we decided not to include the storing of the entire Python environment in the provenance. That is just some historical context, and that being said, it is still interesting to see if we could still add this as optional functionality.
This would indeed be an option. There is indeed the highlight that this could slow down queries if the content you are storing is significant. If you store a full dump of all packages instead in the environment, that may quickly grow out of hand since a typical AiiDA-based Python environment has quite a number of packages. Of course you could decide to store just a selection of the packages that you think are most relevant.
Besides that, there is always the risk of losing the information since extras
are mutable and don’t have the immutability limitations. So it wouldn’t be the safest solution but if you are handling your own data and make sure you don’t overwrite extras, it could still be ok.
This would be an interesting alternative as it would solve the problem of immutability (the repo is immutable once the node is stored, just as the attributes
) and you avoid overloading the database. The repository contents are automatically attached to its node just like the attributes, and so when exporting/importing that information is perfectly preserved. The real problem here though is that once the node is stored, the repo is immutable, so from the calcfunctions body it is impossible to add information to it.
Even if it was not immutable, with the current public API it is not even really possible because to add to the repository or extras, you would need access to the CalcFunctionNode
instance inside the function body:
@calcfunction
def perfect_provenance():
# how do I get the `CalcFunctionNode` here to be able to do:
node.base.repository.put_object_from_string(
'package_list',
io.BytesIO(package_version_dump.encode(b'scipy==1.2.0,....')
)
There is a hack but it is not part of the public API and I am not sure if it is safe, so I wouldn’t recommend it, but I put it here for informational purposes:
@calcfunction
def perfect_provenance():
from aiida.engine import Process
# Retrieve the current process on the stack, which should be this function
process = Process.current()
# From that, we can access the associated node
node = process.node
# And now we can update the extras
node.base.extras.set('package', 'scipy==1.2.0')
If we think this is a valuable feature, it would seem that it requires some functionality in aiida-core
. Maybe we could add a metadat input for calcfunctions that when enabled, it would automatically record the environment information on the node. This way it would automatically be available for all calcfunctions and the user wouldn’t have to implement it each time. E.g.
@calcfunction
def add(a: int, b: int):
return a + b
results, node = add.run_get_node(1, 2, metadata={'store_environment_information': True})
print(node.get_environment_info)
and that would then print the env info representation.
It would be an interesting feature request, but I cannot guarantee that it would be implemented. We would have to think hard how we can make this robust such that it works on all platforms and all types of environments and that we store the right amount information in the right way.