Clean AIIDA storage after deleting the nodes

Hello!

I recently faced something buggy in the way AIIDA maintains the database. Essentially, I finished working on a particular part of my calculations, archived the corresponding nodes group, and then deleted the nodes from AIIDA.

However, it seems like the disk space occupied by these nodes was not freed. Is there a way, perhaps, to update the database state in this situation to free the disk space?

1 Like

Thanks for reposting here @abmazitov! As mentioned, I think running

verdi storage maintain

would already help, but maybe others have a better idea on how this works internally. I’m assuming the postgresql database was actually reduced in size, but the repository part of the storage was perhaps not updated without running the maintenance.

1 Like

If you are referring to the space that is occupied by the database (and not to the file repository), I think verdi storage maintain should indeed solve it. If you delete rows from a PostgreSQL database, the space can be reused by new nodes/rows but it is still reserved, therefore, you don’t see a change in the disk space (again, assuming that you refer to the occupied disk space of the database). This is the typical behavior of a PostgreSQL database: PostgreSQL: Documentation: 16: VACUUM
This vacuum operation is called by verdi storage maintain.

Edit: Even if you are referring to the repository, verdi storage maintain would also take care of the files that don’t belong to a database entry anymore.

1 Like

The solution has already been mentioned: verdi storage maintain will reclaim space that has been freed. For optimal effect, stop the daemon and run verdi storage maintain --full.

Then a slight correction to the provided explanation. verdi storage maintain does not actually trigger a vacuum on the PostgreSQL database. There is currently no API in AiiDA to force this and it is left to PostgreSQL itself to automatically decide when to do this. Instead, verdi storage maintain forces a clean up of the repository. Since the repository has automatic deduplication, when a node gets deleted, we can not automatically proceed and delete the file form the repository, because it may be referenced by another node. This is why you delete nodes, you do not see a reduction in size of the repository on disk. When you call verdi storage maintain, AiiDA checks which files in the repo are no longer referenced by any nodes, and are then fully deleted, which now actually frees up the space on disk. There is a vacuum option for the storage maintain, but this is to vacuum the Sqlite database that is used by the repository. It does not affect the PostgreSQL database.

1 Like

@sphuber Thanks a lot for the clarification and sorry for the confusion!

Not to worry @t-reents ! I appreciate you joining in the discussion, much appreciated :+1:

1 Like

What if in addition, one wanted to also reset the pks? This is often the case when developing. I run a few jobs, test the output, etc. etc., then clean periodically or when satisfied. But my pks are out of control. I’m at 15k+ :sob:

Well, this is a different thing from the original post. How important it is, apart from peace of mind? :slight_smile:

I think you should just ignore the big numbers (and just create a new profile from scratch if it really bothers you); you could check some low-level PostgreSQL commands as e.g. discussed here but I’d be very careful as you risk to make big damage (e.g. codes have a PK and unless you really want to delete everything - and then it’s easier to make a new profile? - you risk to have data loss).

To facilitate this, we could discuss a way to recreate a new empty profile but with some default things already set up (e.g. computers and codes). Some kind of default data stored in YAML that gets filled right after you create the profile.

@mbercx had worked on something like this, I think, to create a new project? (aiida-project?). Good to check it and discuss with him

Not important, just annoying. Not enough to dig into PostgreSQL commands, but definitely enough to consider implementing a verdi command to handle it. I have not played with profiles much. If creating one allocates fresh DB space (blank slate w.r.t. pks), then perhaps this is one valid approach. The verdi command could take care of it then for automation/ease-of-use. Will check with @mbercx :slight_smile:

Haha, I sort of understand your pain, this also used to bother me. But I think after working with databases with 10m+ nodes for so long I’ve become desensitized.

AiiDA-project is more about quickly setting up properly separated (looking at you, $AIIDA_PATH) Python environments and quickly changing between them/nuking them once you’re done (although this is still not complete).

It’s already quite easy to set up new profiles and then delete them:

❯ verdi quicksetup -n --profile tmp --db-name tmp
Success: created new profile `tmp`.
Report: initialising the profile storage.
Report: initialising empty storage schema
Report: Migrating to the head of the main branch
Success: storage initialisation completed.
❯ verdi profile setdefault tmp
Success: tmp set as default profile
❯ verdi shell
Python 3.10.13 (main, Aug 24 2023, 22:43:20) [Clang 14.0.0 (clang-1400.0.29.202)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.12.2 -- An enhanced Interactive Python. Type '?' for help.

In [1]: Int(2).store()
Out[1]: <Int: uuid: c1b04942-88b8-4b84-a16f-474adf3298dd (pk: 1) value: 2>

In [2]: exit()
❯ verdi profile delete -f tmp
Warning: deleting profile `tmp` excluding: database user.
Warning: this operation cannot be undone, Success: profile `tmp` was deleted excluding: database user..
❯ psql -l
                          List of databases
    Name    | Owner  | Encoding | Collate | Ctype | Access privileges
------------+--------+----------+---------+-------+-------------------
 core-dev   | mbercx | UTF8     | C       | C     |
 cwf-dev    | mbercx | UTF8     | C       | C     |
 postgres   | mbercx | UTF8     | C       | C     |
 pseudo-dev | mbercx | UTF8     | C       | C     |
 qe-dev     | mbercx | UTF8     | C       | C     |
 subcon-dev | mbercx | UTF8     | C       | C     |
 super-dev  | mbercx | UTF8     | C       | C     |
 template0  | mbercx | UTF8     | C       | C     | =c/mbercx        +
            |        |          |         |       | mbercx=CTc/mbercx
 template1  | mbercx | UTF8     | C       | C     | =c/mbercx        +
            |        |          |         |       | mbercx=CTc/mbercx
 test-dev   | mbercx | UTF8     | C       | C     |
(10 rows)

But the tedious part is setting up all the computers/codes etc afterward which you might want in your dev environment. I typically have a Makefile with all the commands to do this, but that’s not really practical. I think what we’re looking for here would be more something along the lines of:

This could then also be integrated with AiiDA-project.

The issue you link for verdi --config is not really about that though. Rather, it tries to address the fact that the verdi commands that have dynamic sub-commands, such as verdi code create, now no longer take YAML files that are completely self-contained. For example, if you have a YAML to setup a code, that file doesn’t include the exact subcommand of verdi code create should be called and the user should know that. The idea would be to add --config to the verdi code create base group and that config file can contain the actual subcommand, in addition to the actual command options.

Then, to get back on topic. What we really need is a command that can easily export codes (and their computers). There is the verdi archive create method, but archives are not the ideal vehicle for this. Rather, we would ideally have it be exported to a YAML file. Then there would be a single command that can recreate the codes and computers from said YAML. The latter is actually quite straightforward, especially with the recent improvements using pydantic for these models, and I have already a working version that we use internally at Microsoft. The trickier part is the former though. That is to say, an inflexible hard-coded ad-hoc solution for each code/computer type will be easy, but writing a solution that is applicable to any plugin is what is tricky.

Now that we have pydantic for storage configurations, I will migrate the Code implementations as well. This will make it easier to write such a generic solution. Then we just need to do the Transports as well, and we are almost there.

Ah yes, it seems I projected my desires on that issue quite a bit, thanks for correcting @sphuber.

I think we may be hijacking this topic for a discussion that is somewhat unrelated though, so I will resist the urge to respond here and open a new discussion in the Developer category later today, summarizing the notes above.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.