When writing a workchain, especially when using the context, people can easily create data inside workchain and lose the provenance.
For example, in the aiida-quantumespresso, the PdosWorkChain and pwBandsWorkChain have lines of code to generate new input parameters by using the number of bands from the previous relax calculation.
In these cases, without looking at the source code, we will not know where these input parameters come from, and the logic of provenance is lost.
In AiiDA doc, there is an example to tell people not to create results (output) inside workchain.
Thus, we also need clear documentation on not creating data (input) inside the workchain.
I have to say I have often been struggling with this, it can easily become very tedious especially if you need to make minor modifications in the input files (e.g. adding a WF guess from previous calc). So I am very interested what people think.
I had also pondered this question when I just joined the AiiDA team. At the time, the very first workflows (this was in the pre v1.0 era with a completely different system) would have a lot of calcfunctions to mutate Dict nodes to update workflow input parameters. This was a nightmare for maintenance, because every specific type of change required its own calcfunction implementation so we ended up with a great number of them. I tried to generalize some of them by having a generic Dict update calcfunction but this soon also became too complex or incomplete for various use cases such as nested dictionaries. A provenance-purist mindset had taken over and was trying to plug all holes where provenance was leaking.
At that point, I took a step back and asked myself the question: what is the real reason we are doing this? The point of tracking provenance is to enable retracing the origins of data generated by certain processes. Here we are mostly interested in outputs of processes. Of course outputs can be become inputs to other processes, so in that sense tracking inputs to processes are also important. However, some inputs do not have any real history, they are just “created out of thin air” at the start of a chain of processes. What is the real value that is created by tracking the modification of input parameters in nested workflows? Is any information really lost by not explicitly tracking the modification of input parameter dictionaries?
At this point, a historical note is important. In the very first workflow system of AiiDA (the one I described above) the workflows did not live in the same graph as the calculations (as they do now). In fact, the workflows were not part of the provenance graph whatsoever. This means that the only provenance that was tracked was that of calculations and their input and output data. This motivated the original workflow authors wanting to track input parameter mutations explicitly. With the current provenance paradigm of AiiDA, where workflows are fully part of the provenance graph, I would argue that information is not really even lost. Although the modification of the parameters may not be explicitly represented through a calcfunction, the workflow records the context in which the modification took place. Since the version of the package, from which the workflow comes is recorded, which indirectly captures the source code, the logic of the modification is (albeit indirectly) also recorded. When we look just on the level of calculation provenance, the input may appear to simply pop into existence, but the same would go for input parameters for the top-level workflow.
In the end, I approach this problem pragmatically: what is the problem that we are trying to solve? Is the explicit capturing of input parameter modification through a calcfunction really giving us valuable information that would otherwise be missing? My answer is no. We are currently not losing any valuable information. And the alternative is actually imposing a significant burden. It is making the code more complex, more difficult to maintain, and, arguably, the provenance graph is unnecessarily made more complex.
It is currently not possible to capture perfect provenance and it might not even be necessary or useful. To me, AiiDA should provide users with the tools to track provenance as easily as possible and as complete as necessary. We should simply make it as clear as possible to users when they run the risk of losing provenance and the implications. It is then up to them to decide what level of provenance is important to their use case.
@sphuber this is gold, thanks so much for writing it up! : I definitely agree with the general sentiment. I wonder if some of this wisdom should be captured in the docs (if it isn’t already).
Think it might indeed be useful to have this somewhere in the documentation. Would have been even better as a sort of blog post. We started the idea for this at some point, but it never crystallized. I will have a look if it can be revived. Otherwise I will add it to the docs, but I think it will require some work to add illustrations and code examples to make the concepts clear.
Indeed, it wouldn’t be bad to revive the blog! We still have the old text (in an obscure branch…) on how to explore MC2D in Quantum Mobile: https://github.com/aiidateam/aiida-blog/blob/exploredb/source/stories/browse_discover.rst (and I see there is also a branch by Chris, to check).
Sebastiaan, if you want to check where we could host this (I’d try to do the simplest thing possible, even as a new section of the docs, or anything easy to find anyway) we can then discuss how to proceed at the next AiiDA meeting?
@sphuber Thanks for the detailed explanation. I agree that we that we should focus on the goal of the calculation, instead of perfect provenance.
For the moment, the “level of provenance” is not clear in the doc. So the developer may worry about the unknown risk when they modify the input data. A blog showing the example “level of provenance” with their risk will be very useful.