How to debug a workflow stuck in "Waiting" state

danielhollas · January 18, 2024, 2:50pm

I went through the plumpy code and I see we get to the proc.step_until_terminated() await point and then the worker seems to be stuck.

github.com

aiidateam/plumpy/blob/ff5770f55da9974b693bed5e731c211ee47c39cd/src/plumpy/process_comms.py#L610


      
                  raise communications.TaskRejected('Cannot continue process, no persister')
          
              # Do not catch exceptions here, because if these operations fail, the continue task should except and bubble up
              saved_state = self._persister.load_checkpoint(pid, tag)
              proc = cast('Process', saved_state.unbundle(self._load_context))
          
              if nowait:
                  asyncio.ensure_future(proc.step_until_terminated())
                  return proc.pid
          
              await proc.step_until_terminated()
          
              return proc.future().result()
          
          async def _create(
              self,
              _communicator: kiwipy.Communicator,
              process_class: str,
              persist: bool,
              init_args: Optional[Sequence[Any]] = None,
              init_kwargs: Optional[Dict[str, Any]] = None,

I am not sure how to continue exploring here.

btw: @sphuber your support very much appreciated as always.
The fact the the LOGGER does not print anything definitely made this more challenging and perhaps should be investigated separately…

EDIT: Looking at top, the worker doesn’t seem to consuming any CPUs.
EDIT2: Currently learning to use the Python Debugger (pdb) and stepping through the step_until_terminated() function. Fun!

danielhollas · January 18, 2024, 6:09pm

Here’s the first strange thing when stepping through the process.

(Pdb) where
  /opt/conda/bin/verdi(10)<module>()
-> sys.exit(verdi())
  /opt/conda/lib/python3.9/site-packages/click/core.py(1157)__call__()
-> return self.main(*args, **kwargs)
  /opt/conda/lib/python3.9/site-packages/click/core.py(1078)main()
-> rv = self.invoke(ctx)
  /opt/conda/lib/python3.9/site-packages/click/core.py(1688)invoke()
-> return _process_result(sub_ctx.command.invoke(sub_ctx))
  /opt/conda/lib/python3.9/site-packages/click/core.py(1688)invoke()
-> return _process_result(sub_ctx.command.invoke(sub_ctx))
  /opt/conda/lib/python3.9/site-packages/click/core.py(1434)invoke()
-> return ctx.invoke(self.callback, **ctx.params)
  /opt/conda/lib/python3.9/site-packages/click/core.py(783)invoke()
-> return __callback(*args, **kwargs)
  /opt/conda/lib/python3.9/site-packages/aiida/cmdline/utils/decorators.py(73)wrapper()
-> return wrapped(*args, **kwargs)
  /opt/conda/lib/python3.9/site-packages/aiida/cmdline/commands/cmd_daemon.py(262)worker()
-> start_daemon_worker()
  /opt/conda/lib/python3.9/site-packages/aiida/engine/daemon/worker.py(63)start_daemon_worker()
-> runner.start()
  /opt/conda/lib/python3.9/site-packages/aiida/engine/runners.py(149)start()
-> self._loop.run_forever()
  /opt/conda/lib/python3.9/site-packages/nest_asyncio.py(73)run_forever()
-> self._run_once()
  /opt/conda/lib/python3.9/site-packages/nest_asyncio.py(120)_run_once()
-> handle._run()
  /opt/conda/lib/python3.9/asyncio/events.py(80)_run()
-> self._context.run(self._callback, *self._args)
  /opt/conda/lib/python3.9/site-packages/nest_asyncio.py(196)step()
-> step_orig(task, exc)
  /opt/conda/lib/python3.9/asyncio/tasks.py(256)__step()
-> result = coro.send(None)
  /opt/conda/lib/python3.9/site-packages/plumpy/futures.py(73)run_task()
-> res = await coro()
  /opt/conda/lib/python3.9/site-packages/plumpy/process_comms.py(536)__call__()
-> return await self._continue(communicator, **task.get(TASK_ARGS, {}))
  /opt/conda/lib/python3.9/site-packages/aiida/manage/external/rmq/launcher.py(99)_continue()
-> result = await super()._continue(communicator, pid, nowait, tag)
  /opt/conda/lib/python3.9/site-packages/plumpy/process_comms.py(617)_continue()
-> await proc.step_until_terminated()
  /opt/conda/lib/python3.9/site-packages/plumpy/processes.py(1261)step_until_terminated()
-> await self.step()
> /opt/conda/lib/python3.9/site-packages/plumpy/processes.py(1211)step()
-> if self.paused and self._paused is not None:
(Pdb) self.paused
True
(Pdb)

It seems that when the Process is loaded from the checkpoint, it’s paused property is True, even though if I load the ProcessNode via load_node I see that it is False.

load_node(117419).paused
> False

The function then returns on the await self._paused line below.

Further, I’ve added a breakpoint in the on_played function and ran verdi process play in a separate interpreter. Could step over the whole unpausing process, seemingly without issues. But then seemingly nothing else happens.

Just dumping state here in the hopes that perhaps this debugging strategy will be useful to someone in the future). Marching on…

danielhollas · January 18, 2024, 8:43pm

My debugging was cut short — while playing with pdb I probably interrupted the daemon worker at the wrong time and the task excepted.

I still consider this a useful exercise, learned a bunch of stuff about pdb, aiida-core internals, async internals and how frea**ing annoying it is to debug async programs. Thanks @sphuber for the help.

My plan was to turn on caching and hope for the best, but then I realized that it wouldn’t work anyway since aiida-core version changed and the respective PR is not merged yet. Loose the hashing computing without include core version by unkcpz · Pull Request #6215 · aiidateam/aiida-core · GitHub

danielhollas · January 18, 2024, 8:55pm

I now understand why this happened, but perhaps it would be nice if for the worker started in the foreground, that the logging messages are sent both to the logfile and to stdout. I found the current behaviour really confusing.

sphuber · January 18, 2024, 9:32pm

Thanks for sharing the details of your debugging. It is indeed not trivial to debug that code. It seems indeed that the process gets loaded, but it gets stuck somewhere. I have the strong suspicion that this is due to the fact that the checkpoint is incompatible with the current version. The logic of the workchain internals has been changed (in plumpy) and it stores certain markers in the checkpoint. I suspect that your checkpoint is missing some of the markers that are expected by the code in newer versions. Anyway, not much use to speculate further and hope this was a one-off.

danielhollas · January 24, 2024, 5:33pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to stop the running workflow? New to AiiDA	2	64	March 1, 2024
How to kill a monster workflow :japanese_ogre: General Usage question	3	105	December 23, 2023
Aiida not running past 'Created' General Usage	3	121	August 9, 2023
Process status is created only New to AiiDA	0	83	March 11, 2024
SQLAlchemy Error After an Abnormal Restart General Usage question , aiida	8	53	May 15, 2025

How to debug a workflow stuck in "Waiting" state

Related topics