Dear all,
In the last few weeks I’m struggling a lot with calculations that get stuck in created state even for days. Sometimes after restarting the daemon and/or repairing processes they start again, many other times nothing changes and I should delete the calculation and launch it again. I was using aiida-core 2.6.1 and I tried updating it to 2.6.2, but still the same problem.
Do you have suggestions?
Thank you.
Hi @Davide_Bidoggia , could you please provide some more information. What is the output of verdi status
?
Hi @sphuber, this is the output:
✔ version: AiiDA v2.6.2
✔ config: /home/bidoggia/.aiida
✔ profile: bidoggia
✔ storage: Storage for 'bidoggia' [open] @ postgresql://aiida_qs_bidoggia_1a762e6ea970b7891456fe9e1265d632:***@localhost:5432/bidoggia_bidoggia_1a762e6ea970b7891456fe9e1265d632 / DiskObjectStoreRepository: caa1c1ad4ce44ccfb08f0ca4f0b3cdda | /home/bidoggia/.aiida/repository/bidoggia/container
✔ broker: RabbitMQ v3.9.13 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600
✔ daemon: Daemon is running with PID 1137937
Since you are using RabbitMQ v3.9.13, did you follow the instructions to configure the consumer_timeout
?
This is important for AiiDA to function properly, although it should not really explain the behavior you are describing. What would be very useful is some more information when you get a new process that you launch and gets struck in created. In that case, please check the daemon logs. It should be in ~/.aiida/daemon/log/aiida-{profile-name}.log
. Search it for the pk of the stuck process. Also check the output of verdi process report
and verdi node attributes
. Also stop the daemon and run verdi process repair --dry-run
just to confirm the task is actually missing.
Also, maybe there is just a problem with your daemon. Please check the ~/.aiida/daemon/log/circus-{profile-name}.log
. Maybe your daemon workers are getting killed often and are restarted?
Thank you @sphuber. In last period I had even more difficulties, no more calculations get stuck in created state, but mostly on running state between one method and the next one of those specified in the spec.outline of my workflow (workflow that I run different times without problems with both aiida-core 2.5.1 and 2.6.0). The previous method seems to be finished properly but the next one do not start, if I restart the daemon the previous method is run again and then it gets stuck.
Furthermore I started encountering also problems in exposing outputs (not for all runs, but running many times it became more and more frequent): for example pwcalculation (but it happens also with other plugins) has finished properly with all outputs, but PwBaseWorkChain complains saying:
2024-09-15 12:27:36 [199230 | REPORT]: [326964|PwBaseWorkChain|_attach_outputs]: required output `output_parameters` was not an output of PwCalculation<326973> (or an incorrect class/output is being exposed).
Looking at the log files you suggested I could not find anything interesting.
I thought my installation was somehow corrupted so, after a backup, I uninstalled both aiida, postgresql and rabbitmq and installed them again. I tried both aiida-core 2.5.1 and 2.6.2 and I still have the same problem with exposed outputs. For the tests I did it is no more getting stucked in created or running.
This is the current output of verdi status:
✔ version: AiiDA v2.6.2
✔ config: /home/bidoggia/.aiida
✔ profile: bidoggia
✔ storage: Storage for 'bidoggia' [open] @ postgresql://bidoggia:***@localhost:5432/aiida_db_bidoggia2 / DiskObjectStoreRepository: caa1c1ad4ce44ccfb08f0ca4f0b3cdda | /media/bidoggia/aiida/repository/bidoggia/container
✔ broker: RabbitMQ v3.13.7 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600
/home/bidoggia/py_envs/aiida/lib/python3.10/site-packages/paramiko/pkey.py:82: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
"cipher": algorithms.TripleDES,
/home/bidoggia/py_envs/aiida/lib/python3.10/site-packages/paramiko/transport.py:253: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
"class": algorithms.TripleDES,
✔ daemon: Daemon is running with PID 10120
Thank you for your help!
Thanks @Davide_Bidoggia . Sorry to hear you are still experiencing issues. So to summarize, after your reinstall you:
- no longer have the original problem with calculations that stall
- now only experience problems with the
BaseRestartWorkchain
functionality
Correct?
Is it possible that the workchains that have problems with the attaching of outputs were launched with an older version of aiida-core (i.e. before you reinstalled everything) and continued after the clean install? Does it happen at all for any new workchains that you run now?
To try and diagnose further, could you share the verdi process report
of a workchain that failed with that error?
Hi @sphuber. Yes, the summary is correct.
I launched right now with the new installation just a simple PwBaseWorkChain
without using my workflow (I’m using aiida-quantumespresso==4.5.0), I’ll report you there the outputs of both verdi process show
and verdi process report
for both PwBaseWorkChain
and the called PwCalculation
.
Property Value
----------- -------------------------------------------------------------
type PwBaseWorkChain
state Finished [11] The process did not register a required output.
pk 328199
uuid f0de629c-7c8f-46ad-acbd-19c1e31d390f
label
description
ctime 2024-09-17 09:48:02.776823+02:00
mtime 2024-09-17 09:49:58.290027+02:00
Inputs PK Type
-------------------- ------ -------------
pw
pseudos
Te 107 UpfData
W 68 UpfData
code 252 InstalledCode
structure 328190 StructureData
parameters 328193 Dict
parallelization 328194 Dict
clean_workdir 328195 Bool
kpoints 328192 KpointsData
kpoints_distance 328196 Float
kpoints_force_parity 328197 Bool
max_iterations 328198 Int
Outputs PK Type
------------- ------ ----------
remote_folder 328203 RemoteData
retrieved 328204 FolderData
Called PK Type
------------ ------ -------------
iteration_01 328202 PwCalculation
Log messages
---------------------------------------------
There are 5 log messages for this calculation
Run 'verdi process report 328199' to see them
2024-09-17 09:48:03 [199704 | REPORT]: [328199|PwBaseWorkChain|run_process]: launching PwCalculation<328202> iteration #1
2024-09-17 09:49:58 [199705 | REPORT]: [328199|PwBaseWorkChain|sanity_check_insufficient_bands]: PwCalculation<328202> does not have `output_band` output, skipping sanity check.
2024-09-17 09:49:58 [199706 | REPORT]: [328199|PwBaseWorkChain|results]: work chain completed after 1 iterations
2024-09-17 09:49:58 [199707 | REPORT]: [328199|PwBaseWorkChain|_attach_outputs]: required output `output_parameters` was not an output of PwCalculation<328202> (or an incorrect class/output is being exposed).
2024-09-17 09:49:58 [199708 | REPORT]: [328199|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
Property Value
----------- ------------------------------------
type PwCalculation
state Finished [0]
pk 328202
uuid a69d5816-37fb-4f4b-87c8-73cf8c93c2f7
label
description wte2
ctime 2024-09-17 09:48:03.558797+02:00
mtime 2024-09-17 09:49:59.121887+02:00
computer [3] leo1_scratch_bind
Inputs PK Type
--------------- ------ -------------
pseudos
Te 107 UpfData
W 68 UpfData
code 252 InstalledCode
kpoints 328192 KpointsData
parallelization 328194 Dict
parameters 328200 Dict
settings 328201 Dict
structure 328190 StructureData
Outputs PK Type
----------------- ------ --------------
output_band 328205 BandsData
output_parameters 328208 Dict
output_structure 328207 StructureData
output_trajectory 328206 TrajectoryData
remote_folder 328203 RemoteData
retrieved 328204 FolderData
Caller PK Type
------------ ------ ---------------
iteration_01 328199 PwBaseWorkChain
*** 328202: None
*** (empty scheduler output file)
*** Scheduler errors:
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
*** 0 LOG MESSAGES