Excepted workchains, due to strange error from kiwipy/plumpy?

We think we have a solution, but we are still testing it so it hasn’t been released it yet. If you want to try it as well and report whether it solves your problem, that would be great.
You would have to do the following (check out the branch from my fork and install it):

git clone https://github.com/sphuber/aiida-core
cd aiida-core
git checkout fix/bump-engine-dependencies
pip install -e .

Then restart the daemon and try to run your workchains again.

And importantly, if not done already by the branch:

pip install aio-pika~=9.3

I tried the above steps, restarted the daemon and ran my workchains again. Now the issue is that verdi process list shows the Process status QUEUED and Process state Waiting even though I see that the jobs have finished running on the remote machine and all the outputs have already been written to the expected files and folders.

I double checked and the pip install -e . automatically also installs aio-pika (9.4.0 in my case)

Are these processes that were launched with the previous versions? I am not sure that this is supported, because there may be some changes in the internal state of the processes. Or are these processes that were started after you installed the new version?

I killed all my processes launched with the previous versions and launched the workchains again after I installed the new version. So these are processes that were started after I installed the new version.

Did you make sure to run verdi daemon restart --reset after installation and before launching the processes?

Can you try to run that now, followed by, after waiting for a minute or so, verdi process play --all. The processes might be paused (and not shown correctly by verdi process list) and that might spring them back to action.

I ran verdi daemon restart --reset after installation and before launching the processes. I tried the same now and did play all but the state remains unchanged from Queued

Ok, could you please run

verdi daemon stop
verdi process repair
verdi daemon start

and report the output.

How many active processes do you have and how many daemon workers are running (verdi daemon status)?

If they remain stuck, can you check the output of verdi process report of one of the stuck calcjobs and report the output please.

verdi daemon stop
verdi process repair
verdi daemon start

Output is

Profile: default
Stopping the daemon... OK
Warning: There are active processes without process task: {5120, 5380, 5894, 5639, 4872, 5896, 5642, 5898, 5900, 5134, 5902, 5648, 5904, 5906, 3350, 4886, 5401, 5658, 5148, 5664, 4900, 5671, 5927, 5416, 5162, 5679, 4914, 5430, 5687, 5176, 5690, 4928, 5444, 5190, 4942, 5458, 5715, 5204, 4956, 5472, 5218, 4970, 5232, 5744, 5746, 5748, 4984, 5754, 5246, 5763, 5623, 5765, 4998, 5260, 5772, 5522, 5779, 5012, 5274, 5789, 5534, 5791, 5536, 5027, 5288, 5548, 5556, 4789, 5302, 5048, 5567, 4802, 5570, 5316, 5572, 5574, 5064, 5576, 4816, 5330, 5078, 5849, 4830, 5344, 5092, 5863, 4844, 5874, 5106, 5876, 5618, 5366, 5620, 4858, 5883, 5885, 5630}
Warning: Inconsistencies detected between database and RabbitMQ.
Report: Attempting to fix inconsistencies
Report: Revived process `5120`
Report: Revived process `5639`
Report: Revived process `5642`
Report: Revived process `5134`
Report: Revived process `5648`
Report: Revived process `5658`
Report: Revived process `5148`
Report: Revived process `5664`
Report: Revived process `5671`
Report: Revived process `5162`
Report: Revived process `5679`
Report: Revived process `5687`
Report: Revived process `5176`
Report: Revived process `5690`
Report: Revived process `5190`
Report: Revived process `5715`
Report: Revived process `5204`
Report: Revived process `5218`
Report: Revived process `5232`
Report: Revived process `5744`
Report: Revived process `5746`
Report: Revived process `5748`
Report: Revived process `5754`
Report: Revived process `5246`
Report: Revived process `5763`
Report: Revived process `5765`
Report: Revived process `5260`
Report: Revived process `5772`
Report: Revived process `5779`
Report: Revived process `5274`
Report: Revived process `5789`
Report: Revived process `5791`
Report: Revived process `5288`
Report: Revived process `4789`
Report: Revived process `5302`
Report: Revived process `4802`
Report: Revived process `5316`
Report: Revived process `4816`
Report: Revived process `5330`
Report: Revived process `5849`
Report: Revived process `4830`
Report: Revived process `5344`
Report: Revived process `5863`
Report: Revived process `4844`
Report: Revived process `5874`
Report: Revived process `5876`
Report: Revived process `5366`
Report: Revived process `4858`
Report: Revived process `5883`
Report: Revived process `5885`
Report: Revived process `5380`
Report: Revived process `5894`
Report: Revived process `4872`
Report: Revived process `5896`
Report: Revived process `5898`
Report: Revived process `5900`
Report: Revived process `5902`
Report: Revived process `5904`
Report: Revived process `5906`
Report: Revived process `3350`
Report: Revived process `4886`
Report: Revived process `5401`
Report: Revived process `4900`
Report: Revived process `5927`
Report: Revived process `5416`
Report: Revived process `4914`
Report: Revived process `5430`
Report: Revived process `4928`
Report: Revived process `5444`
Report: Revived process `4942`
Report: Revived process `5458`
Report: Revived process `4956`
Report: Revived process `5472`
Report: Revived process `4970`
Report: Revived process `4984`
Report: Revived process `4998`
Report: Revived process `5522`
Report: Revived process `5012`
Report: Revived process `5534`
Report: Revived process `5536`
Report: Revived process `5027`
Report: Revived process `5548`
Report: Revived process `5556`
Report: Revived process `5048`
Report: Revived process `5567`
Report: Revived process `5570`
Report: Revived process `5572`
Report: Revived process `5574`
Report: Revived process `5064`
Report: Revived process `5576`
Report: Revived process `5078`
Report: Revived process `5092`
Report: Revived process `5106`
Report: Revived process `5618`
Report: Revived process `5620`
Report: Revived process `5623`
Report: Revived process `5630`
Starting the daemon with 1 workers... OK
verdi daemon status
Profile: default
Daemon is running as PID 158564 since 2024-02-27 14:43:12
Active workers [1]:
   PID    MEM %    CPU %  started
------  -------  -------  -------------------
158580    0.815        0  2024-02-27 14:43:12

And that now moved the Process status to retrieve :slightly_smiling_face:

1 Like

Once I did verdi daemon stop, repair, start, all the processes called “HybridizationCalculation” which showed “QUEUED” in verdi but had actually finished running in the remote machine showed
image

The ones with error code [100] shows

verdi process show 5906
Property     Value
-----------  ------------------------------------------------------------------------
type         HybridizationCalculation
state        Finished [100] The process did not have the required `retrieved` output.
pk           5906
uuid         839ed07e-7a6c-40e4-9c9d-613138ed4d38
label
description
ctime        2024-02-26 12:14:53.607658+00:00
mtime        2024-02-27 14:52:44.119861+00:00
computer     [5] tigu

Inputs                      PK    Type
--------------------------  ----  -------------
greens_function
    remote_results_folder   5850  RemoteData
los
    remote_results_folder   1019  RemoteData
code                        103   InstalledCode
energy_grid_parameters      5409  Dict
greens_function_parameters  5408  Dict
matsubara_grid_size         5411  Int
parameters                  5905  Dict
temperature                 5410  Float

Outputs          PK  Type
-------------  ----  ----------
remote_folder  5924  RemoteData

Caller      PK  Type
--------  ----  ------------------------
CALL      5416  CoulombDiamondsWorkChain

I double checked and all the HybridizationCalculations that failed have the required outputs just like the ones that finished successfully. I am not sure what causes some of the calculations to pass while some of them don’t.

Could it be due to the fact that I was using 4 daemons and the ‘verdi daemon stop, repair, start procedure’ only starts 1 daemon (that was what I set as my daemon.default_workers) and the HybridizationCalculation processes that finished successfully were somehow tied to that one daemon?

That is really weird. I think this may have been an artifact of the code upgrade while the processes were alive. There may have been subtle changes in the internal state. We cannot support compatibility for these unreleased changes unfortunately. Apologies for the inconvenience.

Could you please run new calculations and report back if the problem remains the same? My guess is that you won’t see this problem for calculations that are launched now, after the update is done. If the problem persists, please open a new thread, because that would be a real bug.

I write here after a while as i would be keen to understand whether this has been merged into the most recent version of aiida-core. I just installed it via pip but i see that aio-pika v6.8.1 is used, instead of ~9.x.

The new engine fixes have not yet been released. Since they are quite impactful, we wanted to have ample time to test them on master before releasing.