How should I think about managing file trees with Aiida?

I have previously designed high throughput DFT workflows by generating file trees that encode information about each calculation. For example, surface calculations go in a surfs/ directory and are further divided by their material IDs. I’ve gone about analyzing data by navigating the file tree and generating a summary data file by parsing the outputs in each calculation directory. This is convenient because I can dump additional output files and include them later into my summary data file if needed. Querying is also easy because the file tree describes the relationship between calculations.

Aiida seems to work completely differently. Going through the tutorials it seems like all relevant data is stored in data nodes that are included in the provenance database. I’ve noticed from using the pyscf plugin that output files are kept, but that they don’t have any of the DFT code outputs and the directory names are odd.

Does it make sense to code in file tree creation with Aiida? Or should I abandon my comfortable file tree structure and instead embrace the provenance model, storing all data I need with parsers and data nodes?

Furthermore, how should I store large data? For example, if I want to store charge densities for thousands of calculations, does the provenance database handle that scale? If I’m communicating to a remote computer, will I be bottle-necked by the data transfer of large files to the computer running aiida?

Hi Cooper, lot’s of great questions. You are right that AiiDA does indeed store data in a way that does not directly map to a traditional tree structure on a file system. Data is stored both in a database and a file repository and neither can be accessed directly like you are used to with file-based systems.

Writing and launching workflows by creating a file tree first is definitely not a supported approach in AiiDA. For the analysis of workflows, there is currently some functionality to see the inputs and outputs of calculations that ran on a computer (verdi calcjob gotocomputer takes you to the working directory where the calculation ran, and verdi node repo dump can be used to dump the input/output files to a folder on the local disk in tree form), but these are for single calculations. There is an active PR (Add CLI command to dump inputs/outputs of `CalcJob`/`WorkChain` by qiaojunfeng · Pull Request #6276 · aiidateam/aiida-core · GitHub) that will be merged in the near future, that allow to dump (most of) the input and output data for an entire workflow, where the tree structure represents the call hierarchy of the workflow. You could then use your normal approach to analyze that data outside of AiiDA.

But in the end, AiiDA’s API is mostly geared towards querying the provenance graph using the QueryBuilder and then extracting the needed data through the Python API in order to analyze it.

Furthermore, how should I store large data?

This depends. On the use-case, the requirements etc. AiiDA provides the tools to let you choose essentially. As an example, the aiida-quantumespresso plugin leaves big files like charge densities on the remote computer, and so does not retrieve them, exactly for the performance reasons you mention. These files will therefore not be permanently stored in AiiDA’s provenance graph, but only the location of the folder in which they reside is stored, as a sort of symlink. As long as the data still exists, AiiDA can easily help you copy the files remotely to the directory of the next calculation, or use symlinks, to prevent copying the data.

If you decide you want to store this data more long term, there are still a few options.

  1. You could use the stashing functionality: this essentially instructs AiiDA to copy certain output files from the calculation’s working directory to some permanent storage space on the remote computer for safekeeping. The location is again stored with a symlink, so AiiDA won’t store an actualy copy in the provenance graph
  2. You can actually retrieve and store the data in the provenance graph: here you can store it either in the database or the file repository. But for large files, the file repository is recommended. This will be able to deal with large kinds of data as long as you have the disk space. Of course if the data is created on a remote, it will have to be retrieved over the remote connection, which depending on the connection quality and data size, could be a bottleneck as you say.

I hope that helps answer some of your questions. If not, feel free to ask for clarification.