Hi,
I experienced a host crash and forced shutdown, where running aiida daemons were forced to stop. Then I migrated the remaining database to a new machine (verdi status is ok) and saw the following errors when doing verdi process list -a:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected
[SQL: SELECT count(*) AS count_1
FROM (SELECT db_dbnode_1.id AS db_dbnode_1_id, db_dbnode_1.uuid AS db_dbnode_1_uuid, db_dbnode_1.node_type AS db_dbnode_1_node_type, db_dbnode_1.process_type AS db_dbnode_1_process_type, db_dbnode_1.label AS db_dbnode_1_label, db_dbnode_1.description AS db_dbnode_1_description, db_dbnode_1.ctime AS db_dbnode_1_ctime, db_dbnode_1.mtime AS db_dbnode_1_mtime, db_dbnode_1.attributes AS db_dbnode_1_attributes, db_dbnode_1.extras AS db_dbnode_1_extras, db_dbnode_1.repository_metadata AS db_dbnode_1_repository_metadata, db_dbnode_1.dbcomputer_id AS db_dbnode_1_dbcomputer_id, db_dbnode_1.user_id AS db_dbnode_1_user_id
FROM db_dbnode AS db_dbnode_1
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE %(param_1)s AND CASE WHEN (jsonb_typeof((db_dbnode_1.attributes #> %(attributes_1)s)) = %(jsonb_typeof_1)s) THEN (db_dbnode_1.attributes #>> %(attributes_1)s) IN (%(param_2_1)s, %(param_2_2)s, %(param_2_3)s) ELSE %(param_3)s END) AS anon_1]
[parameters: {‘param_1’: ‘process.%’, ‘attributes_1’: ‘{process_state}’, ‘jsonb_typeof_1’: ‘string’, ‘param_3’: False, ‘param_2_1’: ‘created’, ‘param_2_2’: ‘waiting’, ‘param_2_3’: ‘running’}]
(Background on this error at: Error Messages — SQLAlchemy 2.0 Documentation)
And I couldn’t extract the inputs and outputs of many nodes through eg: verdi calcjob inputcat 1442507
Critical: Could not open output path “aiida.in”. Exception: object with key edf0f6365d101d7518a412d9910ed96ced2b60dd6fa39f286c2cb3501f4d9dfd does not exist.
Some PostgreSQL server logs include:
ERROR: invalid page in block 2075 of relation base/16849/16946
ERROR: invalid page in block 2084 of relation base/16849/16946
Any ideas on fixing this database issue from the above error messages?
Many thanks,
Binbin
Hi, admittedly I never saw these errors, but most of them seem to be related to a corrupted DB.
I have no idea of the SSL SYSCALL error, but the invalid page in block seems a DB corruption issue, seee e.g. here (first link I found).
Since I’m not sure what you mean with “you migrated the remaining database to a new machine” or how you exactly did it, I can’t help much. Did you move the pg_data folder? It would be better to dump on a file and reload from a dump, knowing that you might have lost some data.
For the “object with key edf0f6365d101d7518a412d9910ed96ced2b60dd6fa39f286c2cb3501f4d9dfd does not exist.”, this is a bit different, it means the DB points to a file in the disk-objectstore repository, but the file is not there. Does it happen with any calculation, or only for the very last you run? In the first case, it might be a misconfiguration. You say you migrated to a different machine. Did you also move the disk-objectstore repository pointed in the config.json file, or you just moved the DB? You need to move both together (files are stored in the repository, not inside the DB).
Hi Giovanni,
It happens for not only the very last runs but also for many previous calculations. However, for some calculation nodes, one of ’ verdi calcjob inputcat/outputcat’ works. In fact if I do verdi storage maintain --full, it aborted with “RuntimeError: There are objects referenced in the database that are not present in the repository”.
For the pg_data, it is on the SSD, (I used PostgreSQL), so I did an image copy of that SSD, and reinstall it. For the aiida repository we recover the .aiida folder and migrate that folder to the new machine with the same path.
I also checked the problematic node 1442507 in the database. See below:
id | uuid | node_type | repository_metadata
---------±-------------------------------------±-----------------------------------------±-------------------------------------------
1442507 | fae2e690-7da8-4553-a811-fdf481704c4f | process.calculation.calcjob.CalcJobNode. | {“o”: {“.aiida”: {“o”: {“calcinfo.json”: {“k”: “f90c39e1fab8e18ab0088711d402b85c5b63f79a15ae05b9042dcd68f6aa12af”}, “job_tmpl.json”: {“k”: “6cdee62417eed945511f77af9f2a97656471b560bf8ee28b6b9eb6ddfb915b40”}}}, “aiida.in”: {“k”: “edf0f6365d101d7518a412d9910ed96ced2b60dd6fa39f286c2cb3501f4d9dfd”}, “_aiidasubmit.sh”: {“k”: “a76c8fa89962fc67820b9e16c535ac362d5c3dccf94c530c9436765c3b222345”}}}
What would you suggest to check further if it is the database issue or the repository issue?
Thanks,
Binbin
Mmm I see. I think then that if the disks broke, probably the state of the database and of the disk-objectstore are inconsistent, which should normally not happen (but I guess could happen if the computer shut down incorrectly and some files were not yet stored on disk for the repository, or e.g. if a database transaction didn’t complete).
Where the database (pg_data folder) and the disk-objectstore repo on the same partition, or on different ones? If on different ones, this can also explain the issue (probably only one of the two failed, so there is a discrepancy regarding the time of last “save”). From the description of your case (DB entries with pointers to non-existing objects in the repo), it looks like the partition with the disk-objectstore is the one that had issues.
What you could do:
-
go in the .aiida/config.json file, get the filepath of the repository (correct entry in the JSON, should be under storage→config→filepath), go in that folder. There should be a container subfolder. Then run from the command line dostore -p container validate. At the end, it should tell you if there are inconsistencies.
-
if you want to get a sense of how many files are missing, is to loop on the nodes, check indeed the repository metadata (recursively) and for every sha256 hash you find, check if the object is in the disk-objectstore or not. This is more easily done from a verdi shell, that you should open from the same folder. Run
-
import disk_objectstore as dos
container = dos.Container('container')
Depending on what is easiest and how many objects you have, you can get the full list of objects with container.list_all_objects(), or use e.g.
-
for meta in container.get_objects_meta(LIST_OF_HASHES, skip_if_missing=True):
print(meta)
that is more efficient as it does not load all in memory, and will return metadata for each existing object (or skip it if the hash is not found). Whatever you find more easy (the first is easier and in practice OK if you don’t have tens of millions of files)
Thanks Giovanni!
In my setup the PostgreSQL database directory (pg_data) and the AiiDA disk-objectstore repository are on two different partitions (/root on SSD for the DB and /home on HDD for the repo).
I validated the recovered repository with: dostore -p container validate
Loose objects: 100%|
795252/795252 [3:57:01<00:00, 55.92it/s] Error! 6049 objects with error ‘invalid_hashes_loose’
so 0.76% of loose objects where the content hash doesn’t match the filename (sha256), which indicates corruption/incomplete copies. I also checked for missing files/hashes: 110303 missing files for 99257 scanned nodes… And my recovered container/packs folder is empty, is it normal?
Thanks!
Binbin