Recommended way of retrieving additional files of a calculation

It sometimes happens that I discover during the analysis of a simulation that I would like to include an output file in the results that was not part of the original retrieve list of the calculation. My only solution to do this so far is to write a small calcfunction that takes the remote folder of the simulation as an input and retrieves and stores the additional files in the AiiDA repository. This works and ensures there is some link between the files and the calculation that generated them, but it feels a bit hacky. Is there any recommended way to retrieve additional files for a calculation if you discover that the original retrieve list was missing some important files? I suspect it isn’t possible to modify the files stored in the node of the calculation because of immutability after it has been stored?

You are correct, after the retrieval is done, the retrieved output is stored and becomes immutable. Your solution of writing a calcfunction is perfectly fine and not really a hack I would say. Another solution would simply be to update the CalcJob plugin to retrieve that file or just use the metadata.options.additional_retrieve_list input to specify the additional files that should be retrieved and run the calculation again. This does require to run the actual calculation again unfortunately, which, if it is costly, is not ideal, but it is necessary.

1 Like

@sphuber maybe the CalcJobImporter could also be used for this purpose? E.g. for pw.x, once we finally merge

wouldn’t you be able to just add the additional_retrieve_list to the metadata and regenerate a PwCalculation node that has the desired output files without actually rerunning the calculation?

Very good point @mbercx . I think you are right. Would have to look at the specifics to see if the additional_retrieve_list is respected (it has been a while), but that makes sense yeah.

One challenging aspect is that these CalcJobs are likely to be run by a WorkChain, and so even if they all were reimported with the output files they wouldn’t be a part of the provenance as intended.

I was thinking if somehow caching could be used to then restore the WorkChain provenance, but I suppose that wouldn’t work since there is now a remote_folder input node. :thinking:

Thanks for thinking along @mbercx and @sphuber . Ideally it would be possible to create a new node that is identical to the first CalcJob just with the additional files in the retrieved output (and without having to rerun the underlying calculation). Using a CalcJobImporter for this sounds promising, despite the caveats of missing WorkChain provenance. Maybe I should decide on a case by case basis if it’s more suitable to try to use a CalcJobImporter or create a calcfunction that retrieves the additional files.

@ahkole Could you share your calcfunction here? I also need to retrieve some additional files. Thanks!

I usually use this:

from aiida.engine import calcfunction
import aiida.orm as orm
import tempfile
import os.path as path


@calcfunction
def retrieve_files(remote_folder, retrieve_list):
    with tempfile.TemporaryDirectory() as tmpdirname:
        for fname in retrieve_list:
            remote_folder.getfile(fname, path.join(tmpdirname, fname))
        retrieved_files = orm.FolderData(tree=tmpdirname)
    return retrieved_files

You could then use it for example like this:

additional_files = retrieve_files(node.outputs.remote_folder, orm.List(['file1.txt', 'file2.txt']))

Where node is some WorkFlow or CalcJob from which you want to retrieve additional files. You do need to make sure that the files you want to retrieve exist in the remote folder or it will fail. It can also only retrieve single files not directories. Maybe there is a better way of retrieving files from a RemoteData but this is the best one I found so far.

1 Like

As a note, this will open a new ssh connection for every call to getfile. So if you want to run it once in your shell or jupyter notebook (and you don’t have too many files) it’s OK, but one should not put this inside a workflow, otherwise you risk to open many parallel connections (and be banned from the supercomputer)

@giovannipizzi Thanks for the warning! I think for most of my use cases this should be fine. But just in case, is there also an easy way to copy all files from a list with a single SSH connection from a RemoteData?

I fear we don’t have a method ready, but you can take inspiration from the implementation of getfile, and e.g. do something like this: (just an untested example, I removed error try/except for simplicity, check the code for slightly more robust code)

files_to_copy = [
    ['src1', 'dest1'],
    ['src2', 'dest2'],
]

authinfo = remotedata.get_authinfo()
with authinfo.get_transport() as transport:
    # Note that this for loop is *inside* the with, that opens the connection
    for realpath, destpath in files_to_copy: 
        full_path = os.path.join(self.get_remote_path(), relpath)
        transport.getfile(full_path, destpath)

However, while this opens a single connection, still one should not use this directly in a workchain: if you open a transport yourself, and then submit many workflows, a lot of connections will be open and AiiDA has no way to limit those.

When submitting to the daemon, AiiDA instead makes sure to only open a limited number of connections to the remote computers. The best and more general approach is to probably submit a TransferCalculation (a special calculation that does not do anything except copying, and specifying what to copy in the inputs of the TransferCalculation):

1 Like

Thanks! I should be able to figure it out with this.