Calculations get stuck in "created" state

Dear all,
In the last few weeks I’m struggling a lot with calculations that get stuck in created state even for days. Sometimes after restarting the daemon and/or repairing processes they start again, many other times nothing changes and I should delete the calculation and launch it again. I was using aiida-core 2.6.1 and I tried updating it to 2.6.2, but still the same problem.
Do you have suggestions?
Thank you.

Hi @Davide_Bidoggia , could you please provide some more information. What is the output of verdi status?

Hi @sphuber, this is the output:

✔ version:     AiiDA v2.6.2
 ✔ config:      /home/bidoggia/.aiida
 ✔ profile:     bidoggia
 ✔ storage:     Storage for 'bidoggia' [open] @ postgresql://aiida_qs_bidoggia_1a762e6ea970b7891456fe9e1265d632:***@localhost:5432/bidoggia_bidoggia_1a762e6ea970b7891456fe9e1265d632 / DiskObjectStoreRepository: caa1c1ad4ce44ccfb08f0ca4f0b3cdda | /home/bidoggia/.aiida/repository/bidoggia/container
 ✔ broker:      RabbitMQ v3.9.13 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ✔ daemon:      Daemon is running with PID 1137937

Since you are using RabbitMQ v3.9.13, did you follow the instructions to configure the consumer_timeout?

This is important for AiiDA to function properly, although it should not really explain the behavior you are describing. What would be very useful is some more information when you get a new process that you launch and gets struck in created. In that case, please check the daemon logs. It should be in ~/.aiida/daemon/log/aiida-{profile-name}.log. Search it for the pk of the stuck process. Also check the output of verdi process report and verdi node attributes. Also stop the daemon and run verdi process repair --dry-run just to confirm the task is actually missing.

Also, maybe there is just a problem with your daemon. Please check the ~/.aiida/daemon/log/circus-{profile-name}.log. Maybe your daemon workers are getting killed often and are restarted?

Thank you @sphuber. In last period I had even more difficulties, no more calculations get stuck in created state, but mostly on running state between one method and the next one of those specified in the spec.outline of my workflow (workflow that I run different times without problems with both aiida-core 2.5.1 and 2.6.0). The previous method seems to be finished properly but the next one do not start, if I restart the daemon the previous method is run again and then it gets stuck.
Furthermore I started encountering also problems in exposing outputs (not for all runs, but running many times it became more and more frequent): for example pwcalculation (but it happens also with other plugins) has finished properly with all outputs, but PwBaseWorkChain complains saying:

2024-09-15 12:27:36 [199230 | REPORT]: [326964|PwBaseWorkChain|_attach_outputs]: required output `output_parameters` was not an output of PwCalculation<326973> (or an incorrect class/output is being exposed).

Looking at the log files you suggested I could not find anything interesting.

I thought my installation was somehow corrupted so, after a backup, I uninstalled both aiida, postgresql and rabbitmq and installed them again. I tried both aiida-core 2.5.1 and 2.6.2 and I still have the same problem with exposed outputs. For the tests I did it is no more getting stucked in created or running.

This is the current output of verdi status:

 ✔ version:     AiiDA v2.6.2
 ✔ config:      /home/bidoggia/.aiida
 ✔ profile:     bidoggia
 ✔ storage:     Storage for 'bidoggia' [open] @ postgresql://bidoggia:***@localhost:5432/aiida_db_bidoggia2 / DiskObjectStoreRepository: caa1c1ad4ce44ccfb08f0ca4f0b3cdda | /media/bidoggia/aiida/repository/bidoggia/container
 ✔ broker:      RabbitMQ v3.13.7 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600
/home/bidoggia/py_envs/aiida/lib/python3.10/site-packages/paramiko/pkey.py:82: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/home/bidoggia/py_envs/aiida/lib/python3.10/site-packages/paramiko/transport.py:253: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
 ✔ daemon:      Daemon is running with PID 10120

Thank you for your help!

Thanks @Davide_Bidoggia . Sorry to hear you are still experiencing issues. So to summarize, after your reinstall you:

  • no longer have the original problem with calculations that stall
  • now only experience problems with the BaseRestartWorkchain functionality

Correct?

Is it possible that the workchains that have problems with the attaching of outputs were launched with an older version of aiida-core (i.e. before you reinstalled everything) and continued after the clean install? Does it happen at all for any new workchains that you run now?

To try and diagnose further, could you share the verdi process report of a workchain that failed with that error?

Hi @sphuber. Yes, the summary is correct.
I launched right now with the new installation just a simple PwBaseWorkChain without using my workflow (I’m using aiida-quantumespresso==4.5.0), I’ll report you there the outputs of both verdi process show and verdi process report for both PwBaseWorkChain and the called PwCalculation.

Property     Value
-----------  -------------------------------------------------------------
type         PwBaseWorkChain
state        Finished [11] The process did not register a required output.
pk           328199
uuid         f0de629c-7c8f-46ad-acbd-19c1e31d390f
label
description
ctime        2024-09-17 09:48:02.776823+02:00
mtime        2024-09-17 09:49:58.290027+02:00

Inputs                PK      Type
--------------------  ------  -------------
pw
    pseudos
        Te            107     UpfData
        W             68      UpfData
    code              252     InstalledCode
    structure         328190  StructureData
    parameters        328193  Dict
    parallelization   328194  Dict
clean_workdir         328195  Bool
kpoints               328192  KpointsData
kpoints_distance      328196  Float
kpoints_force_parity  328197  Bool
max_iterations        328198  Int

Outputs            PK  Type
-------------  ------  ----------
remote_folder  328203  RemoteData
retrieved      328204  FolderData

Called            PK  Type
------------  ------  -------------
iteration_01  328202  PwCalculation

Log messages
---------------------------------------------
There are 5 log messages for this calculation
Run 'verdi process report 328199' to see them
2024-09-17 09:48:03 [199704 | REPORT]: [328199|PwBaseWorkChain|run_process]: launching PwCalculation<328202> iteration #1
2024-09-17 09:49:58 [199705 | REPORT]: [328199|PwBaseWorkChain|sanity_check_insufficient_bands]: PwCalculation<328202> does not have `output_band` output, skipping sanity check.
2024-09-17 09:49:58 [199706 | REPORT]: [328199|PwBaseWorkChain|results]: work chain completed after 1 iterations
2024-09-17 09:49:58 [199707 | REPORT]: [328199|PwBaseWorkChain|_attach_outputs]: required output `output_parameters` was not an output of PwCalculation<328202> (or an incorrect class/output is being exposed).
2024-09-17 09:49:58 [199708 | REPORT]: [328199|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
Property     Value
-----------  ------------------------------------
type         PwCalculation
state        Finished [0]
pk           328202
uuid         a69d5816-37fb-4f4b-87c8-73cf8c93c2f7
label
description  wte2
ctime        2024-09-17 09:48:03.558797+02:00
mtime        2024-09-17 09:49:59.121887+02:00
computer     [3] leo1_scratch_bind

Inputs           PK      Type
---------------  ------  -------------
pseudos
    Te           107     UpfData
    W            68      UpfData
code             252     InstalledCode
kpoints          328192  KpointsData
parallelization  328194  Dict
parameters       328200  Dict
settings         328201  Dict
structure        328190  StructureData

Outputs                PK  Type
-----------------  ------  --------------
output_band        328205  BandsData
output_parameters  328208  Dict
output_structure   328207  StructureData
output_trajectory  328206  TrajectoryData
remote_folder      328203  RemoteData
retrieved          328204  FolderData

Caller            PK  Type
------------  ------  ---------------
iteration_01  328199  PwBaseWorkChain
*** 328202: None
*** (empty scheduler output file)
*** Scheduler errors:
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP
Warning: ieee_inexact is signaling
FORTRAN STOP

*** 0 LOG MESSAGES