So, uh. Somebody (I hope inadvertently) decided to push the limit of our AiiDAlab deployment and submitted a WorkChain that is kind of big.
Specifically, verdi process list
now counts ~18000 processes (and counting!) in various stages of “running”. I tried to increase the number of daemon workers and tweak various poll intervals but the overall system is hopelessly stuck, and verdi process kill
either hangs or times out saying that process is unreachable.
While I have some ideas how to prevent this situation in the future, the question now is, how do I stop this monster? Any advice would be greatly appreciated. (thoughts and prayers are welcome as well in these trying times )