Performance issues with verdi commands `verdi process` and `verdi storage`

Hello all,

@npaulish has some issues with the performance of verdi commands and we try to figure out what this relates to. We first thought that this is due to the massive amount of loose files that are not packed, but we also have issues with commands like verdi process list that should not access the repository databases and are similar slow. In comparison with verdi profile list we did not have any issues in terms of speed, so I am a bit puzzled. @mbercx mentioned that it might be because of slow Python code that is run during these command.

Here the log the top 5 functions sorted according to percall and tottime using cProfile (see attachment for whole logs).

verdi process list percall

         8456574 function calls (8421386 primitive calls) in 36.534 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   12.884    6.442   13.091    6.545 pathlib.py:532(_select_from)
        3   15.762    5.254   15.767    5.256 {method 'fetchmany' of 'psycopg2.extensions.cursor' objects}
       24    3.810    0.159    3.810    0.159 {method 'execute' of 'psycopg2.extensions.cursor' objects}
        2    0.095    0.047   13.186    6.593 pathlib.py:1130(glob)
        1    0.013    0.013    0.013    0.013 context.py:64(__init__)

verdi process list tottime

         8456574 function calls (8421386 primitive calls) in 36.534 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   15.762    5.254   15.767    5.256 {method 'fetchmany' of 'psycopg2.extensions.cursor' objects}
        2   12.884    6.442   13.091    6.545 pathlib.py:532(_select_from)
       24    3.810    0.159    3.810    0.159 {method 'execute' of 'psycopg2.extensions.cursor' objects}
   747601    0.435    0.000    1.242    0.000 _py39compat.py:25(ep_matches)
   747601    0.388    0.000    0.748    0.000 __init__.py:241(matches)

verdi storage info percall

         6587071 function calls (6555618 primitive calls) in 17.590 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   13.256    6.628   13.449    6.724 pathlib.py:532(_select_from)
       21    0.896    0.043    0.896    0.043 {method 'execute' of 'psycopg2.extensions.cursor' objects}
        2    0.085    0.043   13.535    6.767 pathlib.py:1130(glob)
        2    0.019    0.010    0.020    0.010 {built-in method psycopg2._psycopg._connect}
    46/45    0.032    0.001    0.033    0.001 {built-in method _imp.create_dynamic}

verdi storage info tottime

         6587071 function calls (6555618 primitive calls) in 17.590 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   13.256    6.628   13.449    6.724 pathlib.py:532(_select_from)
       21    0.896    0.043    0.896    0.043 {method 'execute' of 'psycopg2.extensions.cursor' objects}
   541061    0.323    0.000    0.914    0.000 _py39compat.py:25(ep_matches)
   541061    0.283    0.000    0.548    0.000 __init__.py:241(matches)
     1250    0.239    0.000    0.239    0.000 {built-in method io.open_code}

Environment and other information

$ verdi status
 :heavy_check_mark: version:     AiiDA v2.3.1
 :heavy_check_mark: config:      /home/aiida/envs/aiida-fermisurf/.aiida
 :heavy_check_mark: profile:     main
 :heavy_check_mark: storage:     Storage for 'main' [open] @ postgresql://aiida:***@localhost:5432/aiida-fermisurf / DiskObjectStoreRepository: 580f68c6eb6347b7a203030615ef1d40 | /hith/aiida-fermisurf/repositories/main/container
 :heavy_check_mark: rabbitmq:    Connected to RabbitMQ v3.8.2 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 :heavy_check_mark: daemon:      Daemon is running with PID 1193148
$ verdi storage info
entities:
  Users:
    count: 10
  Computers:
    count: 31
  Nodes:
    count: 4734923
  Groups:
    count: 331
  Comments:
    count: 0
  Logs:
    count: 1203422
  Links:
    count: 8936853
repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1

Sizes in the repository folder hith/aiida-fermisurf/repositories/main/container. Note the 2TB loose folder.

$ du -h --max-depth=1
4.0K	./duplicates
4.0K	./sandbox
2.0T	./loose
4.0K	./packs
2.0T	.

cProfile-verdi_storage_info-20top_tottime.txt (1.7 KB)
cProfile-verdi_storage_info-20top_percall.txt (1.7 KB)
cProfile-verdi_process_list-20top_tottime.txt (1.7 KB)
cProfile-verdi_process_list-20top_percall.txt (1.7 KB)

raw-cProfile-verdi_storage_info.txt (652.3 KB)
raw-cProfile-verdi_process_list.txt (706.6 KB)

Thanks a lot @Alex, for the detailed report.
stupid question: verdi storage maintain didn’t help?

We did not try it yet, because it might take a night to finish and the fact that we did not run commands that actual access the files does not suggest that the packing during migrate maintain would help. I think the storage status is directly accessed from the SQL database (at least I can rm all my repository files and still get this information). I am not sure if migrate maintain does other things than packing that might help with performance.

EDIT: replace migrate with maintain as I confused these two

The reason of my suggestion is based on your cProfile results, which indicate that pathlib.py is going crazy searching for all paths. In theory, if the data were properly stored as objects, the code should not need to search for paths so intensively. But maybe It’s just a naive thought…

I am really confused as to why pathlib.py shows up as a major contributor in the profiling. But then again, I have always found that output hard to parse and understand as to what is actually going on.

The results of the verdi process list profile says 36 seconds. Is that also roughly the actual runtime if you don’t profile? Could you please give that number. Although, even if it were only to be half of that, it would still be way too slow.

That being said, we are dealing with quite a large DB at almost 5 milion nodes. What machine is this running on and on what type of disk is the postgres database stored? Do you experience any slowness with other tools on the machine that could indicate a problem with the file system perhaps. Note that here I am not thinking of the file repository as this should not be accessed whatsoever when doing verdi process list. I also don’t expect verdi storage maintain to make much of a difference, even though it may actually do a VACUUM for the psql database (not sure anymore).

As for the phenomenology, the verdi process list command has two parts essentially: 1) list the matched processes and 2) check the number of active processes and compare that with number of daemon slots. This last part is done after printing the table of matched processes and can be slow. Can you describe after how much time the table appears and how much longer the command takes to display the final message of " X out Y slots of the daemon occupied" or something like that.

Final thing: she seems to be having a lot of “dead” processes of over 70 days old. Probably she doesn’t care about these anymore and they can slow things down a bit (although it won’t explain everything). To get rid of them, she could simply delete those nodes. A neat little trick to easily do this is:

verdi node delete $(verdi process list --raw -P pk | tr '\n' ' ')

It essentially fetches all active processes, prints it as “raw” (so no headers and other log info) and projects just the pk. These are then formatted to a single line separated by spaces and fed to verdi node delete.
WARNING: this will delete all processes that are “active”. So only do this when you have no actual processes running that should be kept.

We reran the commands with time and got the some timings as cProfile

Here is a local backtrace from mine machine. The calls go to glob and calls _select_from. It is roughly the same for both commands.

-> return ctx.invoke(self.callback, **ctx.params)
  /home/alexgo/micromamba/envs/aiida-dev/lib/python3.11/site-packages/click/core.py(783)invoke()
-> return __callback(*args, **kwargs)
  /home/alexgo/micromamba/envs/aiida-dev/lib/python3.11/site-packages/click/decorators.py(33)new_func()
-> return f(get_current_context(), *args, **kwargs)
  /home/alexgo/code/aiida-core/src/aiida/cmdline/utils/decorators.py(98)wrapper()
-> load_backend_if_not_loaded()
  /home/alexgo/code/aiida-core/src/aiida/cmdline/utils/decorators.py(76)load_backend_if_not_loaded()
-> manager.get_profile_storage()  # This will load the backend of the loaded profile, if not already loaded
  /home/alexgo/code/aiida-core/src/aiida/manage/manager.py(262)get_profile_storage()
-> ProfileAccessManager(profile).request_access()
  /home/alexgo/code/aiida-core/src/aiida/manage/profile_access.py(67)request_access()
-> self._raise_if_locked(error_message)
  /home/alexgo/code/aiida-core/src/aiida/manage/profile_access.py(200)_raise_if_locked()
-> list_of_files = self._get_tracking_files('.lock', exclude_self=True)
  /home/alexgo/code/aiida-core/src/aiida/manage/profile_access.py(187)_get_tracking_files()
-> list_of_files = [filepath for filepath in path_iterator if filepath != filepath_self]
  /home/alexgo/code/aiida-core/src/aiida/manage/profile_access.py(187)<listcomp>()
-> list_of_files = [filepath for filepath in path_iterator if filepath != filepath_self]
> /home/alexgo/micromamba/envs/aiida-dev/lib/python3.11/pathlib.py(954)glob()
-> for p in selector.select_from(self):
> /home/alexgo/micromamba/envs/aiida-dev/lib/python3.11/pathlib.py(357)_select_from()

Might be what @sphuber mentions that the dead processes are somehow making this operation _get_tracking_files quite slow. Will check if killing dead processes helps (remember darkly that the $AIIDA_PATH/.aiida/access was 4GB large).

backtrack-verdi_process_list.txt (2.7 KB)
backtrack-verdi_storage_info.txt (2.5 KB)

Ah, the globs come from the profile access code. A .pid file is created in .aiida/access/{profile_name} each time the profile is loaded. These are not always cleaned up for technical reasons. They are cleaned when verdi storage maintain is run. Since this hasn’t been run according to your report, there probably is a large accumulation of stale pid files. If you are not running any AiiDA processes, it is safe to delete them. Or otherwise you simply run verdi storage maintain but that will also run the repository optimization which can take a long time.

I am pretty sure if you clean that folder, you should see a reduction in run time. I have never seen a case where performance was affected so badly. We may need to have a separate “hook” to run the cleanup without having to run the storage maintenance.

Indeed - we should add a few hooks in place to clean up those file (when it’s safe to do so) even if takes some time.
For me, one natural place is on verdi daemon (re)start as if it takes a few seconds it’s OK.
That’s still not enough (people might not restart the daemon for many days/weeks) but already something.

I would also think about some other interactive commands, that at least check and warn the user that they should run some maintenance (not necessarily a full storage maintain, I would add some simple command just for this). Some good candidates: verdi shell, verdi process list. Opinions?

Also, it would be good to see the cost of cleaning up vs. the (apparently anyway slow) cost of checking the profile lock files, and consider if it’s OK to perform cleaning up at more places. The only thing to be careful about is that we thought a bit about the logic of those pid files. We need to avoid that we delete a file that indeed should not be deleted, in case of race conditions (e.g. one of the cleaning commands happening while the pid file is being created by the opening of another shell or the execution of another verdi run command, profile loading etc.

I’m not saying that there is a problem (maybe things are done properly and there is no problem), but I’d invest some time to think about possible unwanted effects.