Dear all, I am experiencing a weird behaviour of calcfunction
for a method I wrote to extract structures. This method can generate ~10^3 – 10^4 output nodes, and in the provenance graph something very strange happens:
...
└── FirstWorkChain<5065414> Waiting [2:if_(...)]
├── SecondWorkChain<5065417> Finished [0] [2:run_...]
│ ├── ThirdWorkChain<5065418> Finished [0] [4:if_(...)]
│ │ ...
│ ├── get_structures<5065474> Running [0]
│ └── get_structures<5066148> Finished [0]
...
In the SecondWorkChain
where the get_structures
is called there is no loop that might call a second time the calcfunction. It seems that the first calcfunction job finishes successfully, but somehow does not lock completely (as i guess is still creating the output links). At this point, somehow a second call is made to the same function and it finished correctly. This is very weird, and I also experienced this behaviour multiple times with even more than 2 calls (even 10 at times).
Is there a problem or a known limitation in having so many output nodes? How to circumvent this?
Thanks a lot for any pointer.
Hi Lorenzo! I don’t think there should be a limitation. Which version of AiiDA-core and main dependencies are you using? Which backend? (PSQL+dostore?) Would you be able to make a reproducible mock example so we can debug more easily? Pinging @agoscinski @geiger_j @jusong.yu
Thanks Giovanni for the help! It is PostgreSQL + RabbitMQ. The aiida version i am using is 2.4.0 with a “special patch” from Sebastiaan from a while ago (this is the discussion: Excepted workchains, due to strange error from kiwipy/plumpy?).
I don’t know whether that fix has now been implemented in the latest main
branch of aiida-core.
The error though seems kind of random, depending also on the rest of the daemon workload, and even running using caching to skip the steps the behaviour was quite different each time. So, I guess it’s probably related to the version of aiida-core i am using, and by the fact i am running quite heavily on my workstation (?).
Mmm, it might be. A few things have been fixed recently in the current main branch. If you could try to see if you can reproduce in a test environment with master (soon to be released as 2.7) that would be great.
1 Like
This is very likely due to the bugs in the engine that could result in multiple daemon workers running the same task (exactly as in the thread you linked). This has been fixed and should be released with v2.7 if I remember correctly.