Strategy for freeing up space from local AiiDA repository

I am actively using AiiDA for 2.5 years now. My .aiida repository has grown very large (350 GB, 500 000 nodes) and I am running out of disk space (500 GB hard drive). My repository contains various types of DFT and MD calculations, some of which I need frequently and therefore must be stored locally, and others I only need to keep just in case.

I need an effective way to back-up (archive?) all data, and then manually delete nodes based on certain criteria (e.g. all LAMMPS calculations with more than 1000 atoms older than 6 months). I should also be able to easily import this data again without creating duplicates. Further, it would be a bonus if I could append to an existing archive so that the new local data would merge with the older data that is no longer stored locally. Even better, would it be possible to automate backing up, freeing up space locally and retrieving from cloud when needed in a cloud sync setup?

What is the best way to achieve this in AiiDA? I realize that I probably should have used different profiles for calculations with different purposes - is there a way to do this separation retrospectively? Any advice on good data management practices in AiiDA is appreciated.

Hi @adamg!

Sorry for the slow response. I think you raise some important questions, but somewhat tricky ones, which is probably why no one has replied yet. Below some suggestions/ideas.

First, regarding the backup: doing a full backup before you start playing around with data is probably a good idea, you can use verdi storage backup to directly backup your data to a remote or external hard drive:

Some notes here:

  1. This will create a complete backup of your SQL database and repository, which you would need to “restore” in order to interact with it. So it doesn’t really allow to reimport data as you describe.
  2. You can run the backup command in a cron job to regularly backup your data. The script is smart enough to do incremental backups.
  3. I’m not 100% sure, but the backup is designed to be fully in sync with your profile, so in case you delete data it will also be removed in the backup. Pinging @eimrek to confirm this.
  4. Normally it would be a good idea to run verdi storage maintain on your storage to pack your loose repository files before the backup, but considering the size of your database, and the fact that for safety reasons AiiDA only deletes the loose files at the end of the maintenance, you probably won’t be able to run it at the moment (see this comment and potential fix).

So, this backup command was more designed to allow you to recover your profile/data in case e.g. your hard drive crashes. It doesn’t really give you a lot of flexibility, but at least you can start cleaning up/reorganising your data with peace of mind.

To answer your other questions: we’ve considered/discussed ways of push/pulling to a cloud instance in the past, but never had the manpower to implement it. Appending to an archive or partially importing from it is also not possible at the moment AFAIK. I can see the use case though.

If possible, you can create another AiiDA instance on a remote server. You could even reinstate your backup there, so you know you have all data readily available on this remote. You can then safely delete data locally, and in case you need certain data again you can create an archive from your remote instance and import it locally. AiiDA will understand which nodes are already in your local database, not create duplicates, and link nodes correctly. I suppose you could write a script to automatically create an archive of older nodes, move it to the remote, and import it there. Maybe then run some code to check if all the UUIDs are stored properly before deleting them locally.

I’m afraid I don’t have a better answer at the moment, but maybe others have some ideas as well. :slight_smile:

1 Like