Verdi daemon stuck

Hi,

After several attempts to repair the process, the verdi daemon remains stuck.

So far, I have tried:

  • Restarting the daemon with verdi daemon restart
  • Running the repair process again
  • Increasing the number of daemon workers by adding an extra worker

Unfortunately, none of these steps have resolved the issue, and the process remains stuck.

I’m using aiida-core 2.6.3.

Has anyone encountered a similar issue or have any suggestions on how to troubleshoot this further?

Thanks!

If you are retrieving a huge file, it might simply be very slow in retrieving? Can you monitor the network usage to see if it’s really retrieving the cube file?

Just to follow up on @AndresOrtegaGuerrero 's request. We have multiple users with similar problem. For some reason, daemon looses track of the processes and they get abandoned. The process repair command does something, but that seems to be not enough.

I also checked the output of verdi process report for the “abandoned” processes, and I couldn’t make a consistent picture out of it. Sometimes there are no report messages, like in the case of Andres:

verdi process report 230331
*** 230331 [kp_35_kb_465_471]: CalcJobState.WITHSCHEDULER, scheduler state: JobState.RUNNING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 0 LOG MESSAGES

Sometimes there are many (different user):

+-> ERROR at 2026-06-27 16:34:29.696037+00:00
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2271, in _check_banner
 |     buf = self.packetizer.readline(timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
 |     buf += self._read_timeout(timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/packet.py", line 609, in _read_timeout
 |     raise EOFError()
 | EOFError
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 195, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 258, in __step
 |     result = coro.throw(exc)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 181, in updating
 |     await self._update_job_info()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
 |     future.result()
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/transports.py", line 87, in do_open
 |     transport.open()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 498, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/client.py", line 421, in connect
 |     t.start_client(timeout=timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 699, in start_client
 |     raise e
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2094, in run
 |     self._check_banner()
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2275, in _check_banner
 |     raise SSHException(
 | paramiko.ssh_exception.SSHException: Error reading SSH protocol banner

It has nothing to do with huge files, for some users the jobs gets stuck for days

We really need to first try with AiiDA 2.8 and the new asyncssh - in our experience this already fixes a lot of issues.
Or even better, we’re going to soon release ZeroMQ instead of RabbitMQ that seems much more reliable (it’s already on main). @yakutovicha @AndresOrtegaGuerrero how reprodubile is the problem? Is it worth preparing an AiiDAlab image for AiiDA main (so it’s also ready for 2.9) and test - at least from the command line if the apps require update - if this fixes the issue? with @edan-bainglass and Moritz we had a similar issue (combined with slow network and slow HPC) and the new code solved it