It sometimes happens that I discover during the analysis of a simulation that I would like to include an output file in the results that was not part of the original retrieve list of the calculation. My only solution to do this so far is to write a small calcfunction that takes the remote folder of the simulation as an input and retrieves and stores the additional files in the AiiDA repository. This works and ensures there is some link between the files and the calculation that generated them, but it feels a bit hacky. Is there any recommended way to retrieve additional files for a calculation if you discover that the original retrieve list was missing some important files? I suspect it isnât possible to modify the files stored in the node of the calculation because of immutability after it has been stored?
You are correct, after the retrieval is done, the retrieved
output is stored and becomes immutable. Your solution of writing a calcfunction is perfectly fine and not really a hack I would say. Another solution would simply be to update the CalcJob
plugin to retrieve that file or just use the metadata.options.additional_retrieve_list
input to specify the additional files that should be retrieved and run the calculation again. This does require to run the actual calculation again unfortunately, which, if it is costly, is not ideal, but it is necessary.
@sphuber maybe the CalcJobImporter
could also be used for this purpose? E.g. for pw.x
, once we finally merge
wouldnât you be able to just add the additional_retrieve_list
to the metadata
and regenerate a PwCalculation
node that has the desired output files without actually rerunning the calculation?
Very good point @mbercx . I think you are right. Would have to look at the specifics to see if the additional_retrieve_list
is respected (it has been a while), but that makes sense yeah.
One challenging aspect is that these CalcJob
s are likely to be run by a WorkChain
, and so even if they all were reimported with the output files they wouldnât be a part of the provenance as intended.
I was thinking if somehow caching could be used to then restore the WorkChain
provenance, but I suppose that wouldnât work since there is now a remote_folder
input node.
Thanks for thinking along @mbercx and @sphuber . Ideally it would be possible to create a new node that is identical to the first CalcJob
just with the additional files in the retrieved
output (and without having to rerun the underlying calculation). Using a CalcJobImporter
for this sounds promising, despite the caveats of missing WorkChain
provenance. Maybe I should decide on a case by case basis if itâs more suitable to try to use a CalcJobImporter
or create a calcfunction that retrieves the additional files.
@ahkole Could you share your calcfunction here? I also need to retrieve some additional files. Thanks!
I usually use this:
from aiida.engine import calcfunction
import aiida.orm as orm
import tempfile
import os.path as path
@calcfunction
def retrieve_files(remote_folder, retrieve_list):
with tempfile.TemporaryDirectory() as tmpdirname:
for fname in retrieve_list:
remote_folder.getfile(fname, path.join(tmpdirname, fname))
retrieved_files = orm.FolderData(tree=tmpdirname)
return retrieved_files
You could then use it for example like this:
additional_files = retrieve_files(node.outputs.remote_folder, orm.List(['file1.txt', 'file2.txt']))
Where node
is some WorkFlow or CalcJob from which you want to retrieve additional files. You do need to make sure that the files you want to retrieve exist in the remote folder or it will fail. It can also only retrieve single files not directories. Maybe there is a better way of retrieving files from a RemoteData
but this is the best one I found so far.
As a note, this will open a new ssh connection for every call to getfile
. So if you want to run it once in your shell or jupyter notebook (and you donât have too many files) itâs OK, but one should not put this inside a workflow, otherwise you risk to open many parallel connections (and be banned from the supercomputer)
@giovannipizzi Thanks for the warning! I think for most of my use cases this should be fine. But just in case, is there also an easy way to copy all files from a list with a single SSH connection from a RemoteData
?
I fear we donât have a method ready, but you can take inspiration from the implementation of getfile
, and e.g. do something like this: (just an untested example, I removed error try/except for simplicity, check the code for slightly more robust code)
files_to_copy = [
['src1', 'dest1'],
['src2', 'dest2'],
]
authinfo = remotedata.get_authinfo()
with authinfo.get_transport() as transport:
# Note that this for loop is *inside* the with, that opens the connection
for realpath, destpath in files_to_copy:
full_path = os.path.join(self.get_remote_path(), relpath)
transport.getfile(full_path, destpath)
However, while this opens a single connection, still one should not use this directly in a workchain: if you open a transport yourself, and then submit many workflows, a lot of connections will be open and AiiDA has no way to limit those.
When submitting to the daemon, AiiDA instead makes sure to only open a limited number of connections to the remote computers. The best and more general approach is to probably submit a TransferCalculation (a special calculation that does not do anything except copying, and specifying what to copy in the inputs of the TransferCalculation):
Thanks! I should be able to figure it out with this.