How to debug a workflow stuck in "Waiting" state

I went through the plumpy code and I see we get to the proc.step_until_terminated() await point and then the worker seems to be stuck.

I am not sure how to continue exploring here.

btw: @sphuber your support very much appreciated as always. :pray:
The fact the the LOGGER does not print anything definitely made this more challenging and perhaps should be investigated separately…

EDIT: Looking at top, the worker doesn’t seem to consuming any CPUs.
EDIT2: Currently learning to use the Python Debugger (pdb) and stepping through the step_until_terminated() function. Fun!

Here’s the first strange thing when stepping through the process.

(Pdb) where
  /opt/conda/bin/verdi(10)<module>()
-> sys.exit(verdi())
  /opt/conda/lib/python3.9/site-packages/click/core.py(1157)__call__()
-> return self.main(*args, **kwargs)
  /opt/conda/lib/python3.9/site-packages/click/core.py(1078)main()
-> rv = self.invoke(ctx)
  /opt/conda/lib/python3.9/site-packages/click/core.py(1688)invoke()
-> return _process_result(sub_ctx.command.invoke(sub_ctx))
  /opt/conda/lib/python3.9/site-packages/click/core.py(1688)invoke()
-> return _process_result(sub_ctx.command.invoke(sub_ctx))
  /opt/conda/lib/python3.9/site-packages/click/core.py(1434)invoke()
-> return ctx.invoke(self.callback, **ctx.params)
  /opt/conda/lib/python3.9/site-packages/click/core.py(783)invoke()
-> return __callback(*args, **kwargs)
  /opt/conda/lib/python3.9/site-packages/aiida/cmdline/utils/decorators.py(73)wrapper()
-> return wrapped(*args, **kwargs)
  /opt/conda/lib/python3.9/site-packages/aiida/cmdline/commands/cmd_daemon.py(262)worker()
-> start_daemon_worker()
  /opt/conda/lib/python3.9/site-packages/aiida/engine/daemon/worker.py(63)start_daemon_worker()
-> runner.start()
  /opt/conda/lib/python3.9/site-packages/aiida/engine/runners.py(149)start()
-> self._loop.run_forever()
  /opt/conda/lib/python3.9/site-packages/nest_asyncio.py(73)run_forever()
-> self._run_once()
  /opt/conda/lib/python3.9/site-packages/nest_asyncio.py(120)_run_once()
-> handle._run()
  /opt/conda/lib/python3.9/asyncio/events.py(80)_run()
-> self._context.run(self._callback, *self._args)
  /opt/conda/lib/python3.9/site-packages/nest_asyncio.py(196)step()
-> step_orig(task, exc)
  /opt/conda/lib/python3.9/asyncio/tasks.py(256)__step()
-> result = coro.send(None)
  /opt/conda/lib/python3.9/site-packages/plumpy/futures.py(73)run_task()
-> res = await coro()
  /opt/conda/lib/python3.9/site-packages/plumpy/process_comms.py(536)__call__()
-> return await self._continue(communicator, **task.get(TASK_ARGS, {}))
  /opt/conda/lib/python3.9/site-packages/aiida/manage/external/rmq/launcher.py(99)_continue()
-> result = await super()._continue(communicator, pid, nowait, tag)
  /opt/conda/lib/python3.9/site-packages/plumpy/process_comms.py(617)_continue()
-> await proc.step_until_terminated()
  /opt/conda/lib/python3.9/site-packages/plumpy/processes.py(1261)step_until_terminated()
-> await self.step()
> /opt/conda/lib/python3.9/site-packages/plumpy/processes.py(1211)step()
-> if self.paused and self._paused is not None:
(Pdb) self.paused
True
(Pdb)

It seems that when the Process is loaded from the checkpoint, it’s paused property is True, even though if I load the ProcessNode via load_node I see that it is False.

load_node(117419).paused
> False

The function then returns on the await self._paused line below.

Further, I’ve added a breakpoint in the on_played function and ran verdi process play in a separate interpreter. Could step over the whole unpausing process, seemingly without issues. But then seemingly nothing else happens. :thinking:

Just dumping state here in the hopes that perhaps this debugging strategy will be useful to someone in the future). Marching on…

My debugging was cut short — while playing with pdb I probably interrupted the daemon worker at the wrong time and the task excepted. :anguished:

I still consider this a useful exercise, learned a bunch of stuff about pdb, aiida-core internals, async internals and how frea**ing annoying it is to debug async programs. :sweat_smile: Thanks @sphuber for the help. :clap:

My plan was to turn on caching and hope for the best, but then I realized that it wouldn’t work anyway since aiida-core version changed and the respective PR is not merged yet. Loose the hashing computing without include core version by unkcpz · Pull Request #6215 · aiidateam/aiida-core · GitHub

I now understand why this happened, but perhaps it would be nice if for the worker started in the foreground, that the logging messages are sent both to the logfile and to stdout. I found the current behaviour really confusing.

Thanks for sharing the details of your debugging. It is indeed not trivial to debug that code. It seems indeed that the process gets loaded, but it gets stuck somewhere. I have the strong suspicion that this is due to the fact that the checkpoint is incompatible with the current version. The logic of the workchain internals has been changed (in plumpy) and it stores certain markers in the checkpoint. I suspect that your checkpoint is missing some of the markers that are expected by the code in newer versions. Anyway, not much use to speculate further and hope this was a one-off.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.