WorkChain exception inside AiidaLab container

Dear Community,

I am trying to run a workchain that has several steps, the workchain seems to work fine in with a small system in the localhost (i am working in the aiidalab container ) but when i tried a bigger system and using a cluster i get this error

2023-11-10 11:08:44 [6174 | REPORT]: [23510|DielectricWorkChain|on_except]: Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 324, in transition_to
    self._enter_next_state(new_state)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 300, in _fire_state_event
    callback(self, hook, state)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/processes.py", line 331, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 426, in on_entered
    super().on_entered(from_state)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/processes.py", line 714, in on_entered
    self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/communications.py", line 175, in broadcast_send
    return self._communicator.broadcast_send(body, sender, subject, correlation_id)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 258, in broadcast_send
    result = self._loop_scheduler.await_(
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 164, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in coro
    res = await awaitable
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 522, in broadcast_send
    result = await publisher.broadcast_send(body, sender, subject, correlation_id)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 66, in broadcast_send
    return await self.publish(message, routing_key=defaults.BROADCAST_TOPIC, mandatory=False)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/messages.py", line 209, in publish
    result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/exchange.py", line 233, in publish
    return await asyncio.wait_for(
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 508, in basic_publish
    async with self.lock:
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 90, in lock
    raise ChannelInvalidStateError("%r closed" % self)
aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed

I was wondering any of you could help me of what could be the sources, sometimes i need to do a verdi daemon restart for the workchain to run since it can remain in the created status for a while. I am using aiida-core 2.3.1

Hi @AndresOrtegaGuerrero! Are you submitting a lot processes within a short time frame? The error seems similar to the one raised in this issue:

Hi @mbercx , i think so , I am running the DielectricWorkChain from aiida-vibroscopy

Thanks for the report @AndresOrtegaGuerrero . I had a look at the code and this exception can be thrown when the connection is temporarily unavailable. The code is trying to broadcast a state change of the process to all subscribers. Although useful, it is not critical if this broadcast fails. Listening processes have a polling mechanism as backup. So we definitely shouldn’t let this exception topple the entire process. I looked in the code of plumpy and we already catch a ConnectionClosed exception. I opened a PR that simply also catches this other exception. This should hopefully fix the issue. @mbercx could you please review the PR? Catch `ChannelInvalidStateError` in process state change by sphuber Β· Pull Request #278 Β· aiidateam/plumpy Β· GitHub

Once merged in, I will make a release of plumpy (there is also another feature already that I want to release). I will ping here when it is available so you can update and hopefully the problem is gone.

1 Like

Ok, the release is done. You can run pip install plumpy==0.21.9 and then verdi daemon restart --reset. Then try to run your workchain again. Please let us know here how it goes.

Thank you @sphuber , i tried what you suggested,
this is the ouput from the logs of the deamon

    res = await coro()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/process_comms.py", line 536, in __call__
    return await self._continue(communicator, **task.get(TASK_ARGS, {}))
  File "/opt/conda/lib/python3.9/site-packages/aiida/manage/external/rmq/launcher.py", line 87, in _continue
    return future.result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
aiida.engine.exceptions.PastException: aiormq.exceptions.ChannelInvalidStateError: writer is None

the workchain again was excepted , and some childs of the workchain as well
and some have this

25187  3h ago     PwCalculation    ⏸ Waiting        Pausing after failed transport task: submit_calculation failed 5 times consecutively

What are the timestamps of that exception though? Could they not be exceptions from before the change? Is the calculation with pk 25187 one you launched after having installed the new version and restarted the daemon, or did that one already exist? What is the output of verdi process report 25187?

Hi @sphuber , this is the last part of the output

+-> WARNING at 2023-11-10 16:55:22.169019+00:00
 | maximum attempts 5 of calling do_upload, exceeded
+-> ERROR at 2023-11-10 16:59:34.390073+00:00
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit
 |     return execmanager.submit_calculation(node, transport)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 379, in submit_calculation
 |     result = scheduler.submit_from_script(workdir, submit_script_filename)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 410, in submit_from_script
 |     return self._parse_submit_output(*result)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 430, in _parse_submit_output
 |     raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}')
 | aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: Unable to open file _aiidasubmit.sh
 | 
+-> ERROR at 2023-11-10 17:00:48.109295+00:00
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit
 |     return execmanager.submit_calculation(node, transport)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 379, in submit_calculation
 |     result = scheduler.submit_from_script(workdir, submit_script_filename)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 410, in submit_from_script
 |     return self._parse_submit_output(*result)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 430, in _parse_submit_output
 |     raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}')
 | aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: Unable to open file _aiidasubmit.sh
 | 
+-> ERROR at 2023-11-10 17:01:57.093854+00:00
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2271, in _check_banner
 |     buf = self.packetizer.readline(timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
 |     buf += self._read_timeout(timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/packet.py", line 609, in _read_timeout
 |     raise EOFError()
 | EOFError
 | 
 | During handling of the above exception, another exception occurred:
 | 
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 145, in do_submit
 |     transport = await cancellable.with_interrupt(request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 94, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 258, in __step
 |     result = coro.throw(exc)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
 |     future.result()
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 498, in open
 |     proxy_client.connect(proxy['host'], **proxy_connargs)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/client.py", line 421, in connect
 |     t.start_client(timeout=timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 699, in start_client
 |     raise e
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2094, in run
 |     self._check_banner()
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2275, in _check_banner
 |     raise SSHException(
 | paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
+-> ERROR at 2023-11-10 17:03:22.373291+00:00
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit
 |     return execmanager.submit_calculation(node, transport)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 379, in submit_calculation
 |     result = scheduler.submit_from_script(workdir, submit_script_filename)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 410, in submit_from_script
 |     return self._parse_submit_output(*result)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 430, in _parse_submit_output
 |     raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}')
 | aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: Unable to open file _aiidasubmit.sh
 | 
+-> ERROR at 2023-11-10 17:07:08.424598+00:00
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit
 |     return execmanager.submit_calculation(node, transport)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 379, in submit_calculation
 |     result = scheduler.submit_from_script(workdir, submit_script_filename)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 410, in submit_from_script
 |     return self._parse_submit_output(*result)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 430, in _parse_submit_output
 |     raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}')
 | aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: Unable to open file _aiidasubmit.sh
 | 
+-> WARNING at 2023-11-10 17:07:08.434079+00:00
 | maximum attempts 5 of calling do_submit, exceeded

I think there is a problem here that is not related but may have to do with the updating of the code for an existing calculation. For this calculation, apparently the _aiidasubmit.sh script was not created. There is no way to recover from this, and we should simply kill and delete this calculation.

Could you simply please try to launch a new calculation/workchain and see if that works?

Hi @sphuber , I just tried again the workchain got an except

2023-11-13 09:59:13 [7059 | WARNING]: Process<25288>: no connection available to broadcast state change from running to excepted
2023-11-13 09:59:13 [7060 |   ERROR]: Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/plumpy/processes.py", line 888, in on_close
    cleanup()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/communications.py", line 144, in remove_rpc_subscriber
    return self._communicator.remove_rpc_subscriber(identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 221, in remove_rpc_subscriber
    return self._loop_scheduler.await_(self._communicator.remove_rpc_subscriber(identifier))
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 164, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 258, in __step
    result = coro.throw(exc)
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in coro
    res = await awaitable
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 487, in remove_rpc_subscriber
    await msg_subscriber.remove_rpc_subscriber(identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 140, in remove_rpc_subscriber
    await rpc_queue.cancel(identifier)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/robust_queue.py", line 140, in cancel
    result = await super().cancel(consumer_tag, timeout, nowait)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/queue.py", line 264, in cancel
    return await asyncio.wait_for(
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 395, in basic_cancel
    return await self.rpc(
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 168, in wrap
    return await self.create_task(func(self, *args, **kwargs))
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 25, in __inner
    return await self.task
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 121, in rpc
    raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None

2023-11-13 09:59:23 [7061 |  REPORT]: [25288|DielectricWorkChain|on_terminated]: cleaned remote folders of calculations: 25301 25432 25653

I will explore the options within the workchains to delay the submission of scf, and also to run serial to see if the issue is resolved

Hmm. Did you launch just a single workchain? How many processes does that spawn.

There seems to be something really off with your connection to RabbitMQ. Where is RabbitMQ running? Can you report the output of verdi status?

Hi @sphuber ,

this my status, no issue on RabbitMQ

 βœ” version:     AiiDA v2.3.1
 βœ” config:      /home/jovyan/.aiida
 βœ” profile:     default
 βœ” storage:     Storage for 'default' [open] @ postgresql://aiida:***@localhost:5432/aiida_db / DiskObjectStoreRepository: 406d090665c941ef98807cc2109af721 | /home/jovyan/.aiida/repository/default/container
 βœ” rabbitmq:    Connected to RabbitMQ v3.9.13 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 βœ” daemon:      Daemon is running with PID 253

The nature of the WorkChain usually launch several WorkChains

HarmonicWorkChain<25278> Finished [401] [2:inspect_processes]
    β”œβ”€β”€ generate_preprocess_data<25279> Finished [0]
    β”œβ”€β”€ PhononWorkChain<25284> Finished [0] [7:if_(should_run_phonopy)(1:inspect_phonopy)]
    β”‚   β”œβ”€β”€ generate_preprocess_data<25289> Finished [0]
    β”‚   β”œβ”€β”€ get_supercell<25296> Finished [0]
    β”‚   β”œβ”€β”€ create_kpoints_from_distance<25298> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25304> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25307> Finished [0]
    β”‚   β”œβ”€β”€ get_supercells_with_displacements<25321> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25359> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25435> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25361> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25438> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25363> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25441> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25365> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25444> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25367> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25447> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25369> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25450> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25371> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25453> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25373> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25456> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25375> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25459> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25377> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25462> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25379> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25465> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25381> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25468> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25383> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25471> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25385> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25474> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25387> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25477> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25389> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25480> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25391> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25483> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25393> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25486> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25395> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25489> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25397> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25492> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25399> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25495> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25401> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25498> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25403> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25501> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25405> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25504> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25407> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25507> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25409> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25510> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25411> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25513> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25413> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25516> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25415> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25519> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25417> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25522> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25419> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25525> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25421> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25528> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25423> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25531> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25425> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25534> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25427> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25537> Finished [0]
    β”‚   β”œβ”€β”€ PwBaseWorkChain<25429> Finished [0] [3:results]
    β”‚   β”‚   └── PwCalculation<25540> Finished [0]
    β”‚   β”œβ”€β”€ generate_phonopy_data<25740> Finished [0]
    β”‚   └── PhonopyCalculation<25742> Finished [0]
    └── DielectricWorkChain<25288> Excepted [8:while_(should_run_electric_field_scfs)]
        β”œβ”€β”€ create_kpoints_from_distance<25290> Finished [0]
        β”œβ”€β”€ PwBaseWorkChain<25295> Finished [0] [3:results]
        β”‚   └── PwCalculation<25301> Finished [0]
        β”œβ”€β”€ PwBaseWorkChain<25320> Finished [0] [3:results]
        β”‚   └── PwCalculation<25432> Finished [0]
        β”œβ”€β”€ compute_critical_electric_field<25642> Finished [0]
        β”œβ”€β”€ get_accuracy_from_critical_field<25644> Finished [0]
        β”œβ”€β”€ get_electric_field_step<25646> Finished [0]
        β”œβ”€β”€ PwBaseWorkChain<25650> Finished [0] [3:results]
        β”‚   └── PwCalculation<25653> Finished [0]
        └── PwBaseWorkChain<25752> Created

Hi @AndresOrtegaGuerrero , please use RabbitMQ version < 3.8.15.
Please check this issue, and the doc.

@Xing is right that in most cases it is preferred to use RabbitMQ < 3.8.15, unless the server is configured as mentioned in AiiDA’s docs. Then it is fine to use more modern versions. But regardless of the version, the problem we see here should not be related to the RabbitMQ version.

The last exception you posted is similar in nature to the one I fixed, but just in a different place. I will try to patch that as well and try and find other locations where this could occur, hypothetically. But it is a bit difficult to spot them without actually running into the problem.

@sphuber I try again with no luck

2023-11-14 14:27:25 [7279 | WARNING]: Process<26839>: no connection available to broadcast state change from running to excepted
2023-11-14 14:27:25 [7280 |   ERROR]: Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/plumpy/processes.py", line 888, in on_close
    cleanup()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/communications.py", line 144, in remove_rpc_subscriber
    return self._communicator.remove_rpc_subscriber(identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 221, in remove_rpc_subscriber
    return self._loop_scheduler.await_(self._communicator.remove_rpc_subscriber(identifier))
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 164, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 258, in __step
    result = coro.throw(exc)
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in coro
    res = await awaitable
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 487, in remove_rpc_subscriber
    await msg_subscriber.remove_rpc_subscriber(identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 140, in remove_rpc_subscriber
    await rpc_queue.cancel(identifier)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/robust_queue.py", line 140, in cancel
    result = await super().cancel(consumer_tag, timeout, nowait)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/queue.py", line 264, in cancel
    return await asyncio.wait_for(
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 395, in basic_cancel
    return await self.rpc(
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 168, in wrap
    return await self.create_task(func(self, *args, **kwargs))
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 25, in __inner
    return await self.task
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 121, in rpc
    raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None

should i downgrade the RabbitMQ ?

As I mentioned before, I don’t think the version of RabbitMQ is the cause here. The problem is that at the end of the process, the cleanup method is called. This tries to remove itself as an rpc subscriber, which it needs to do over the connection to RabbitMQ. This connection fails, causing the exception. The thing I don’t understand is that the cleanup() call is wrapped in a try-except block (see here https://github.com/aiidateam/plumpy/blob/ff5770f55da9974b693bed5e731c211ee47c39cd/src/plumpy/processes.py#L888 ). It should catch the exception and simply log it, but not let the process fall over as is happening in your case. I am not sure why this is not being caught. If I can figure that out, we can fix the problem, but I need more time to look at it.

There is 1 other thing you can try in the meantime. I have an open branch that updates aio-pika and aiormq (the libraries that are used to connect to RabbitMQ) to newer versions that are supposed to be more stable. It would be great if you could give that branch a go.

Doing so is simple. You just need to do the following:

  • activate your virtual environment (conda or other)
  • git clone https://github.com/sphuber/aiida-core
  • cd aida-core
  • git checkout fix/bump-engine-dependencies
  • pip install -e .
  • verdi daemon restart --reset

Now you can relaunch a new workchain and see if that helps. It would be of great help to see if this reduces the problem since you seem to have a case that is so reproducible.

@sphuber thanks for the help, so currently I am working in the aiidalab-launch container, and since i am working with the Qe App (latest version) the branch of the aiida-core you shared is incompatible. I am going to run locally to see if maybe it works, one thing i noticed is that is only in an specific workchain were i get the except (https://github.com/bastonero/aiida-vibroscopy/blob/main/src/aiida_vibroscopy/workflows/dielectric/base.py)

I will test locally if the workchain doesnt fail, in case is an issue within the container, though in the container i can run the workchain with pw.localhost with no problem

@sphuber , so I did two test, one is in my personal computer and submitting the job in verdi shell, and the other one using the aiidalab-container, in both i am using aiida-core AiiDA v2.4… In my personal computer the workchain completes, but in the container i have the same issue

"023-11-17 10:05:11 [10042 |  REPORT]:   [33693|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-11-17 10:06:11 [10043 | WARNING]: Process<33660>: no connection available to broadcast state change from waiting to running
2023-11-17 10:06:11 [10044 | WARNING]: Process<33660>: no connection available to broadcast state change from running to running
2023-11-17 10:06:11 [10045 |  REPORT]: [33660|DielectricWorkChain|on_except]: Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/plumpy/process_states.py", line 228, in execute
    result = self.run_fn(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/workchains/workchain.py", line 314, in _do_step
    finished, stepper_result = self._stepper.step()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/workchains.py", line 295, in step
    finished, result = self._child_stepper.step()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/workchains.py", line 538, in step
    finished, result = self._child_stepper.step()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/workchains.py", line 295, in step
    finished, result = self._child_stepper.step()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/workchains.py", line 246, in step
    return True, self._fn(self._workchain)
  File "/home/jovyan/aiida-vibroscopy/src/aiida_vibroscopy/workflows/dielectric/base.py", line 690, in run_electric_field_scfs
    node = self.submit(PwBaseWorkChain, **inputs)
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 544, in submit
    return self.runner.submit(process, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/runners.py", line 183, in submit
    process_inited = self.instantiate_process(process, **inputs)
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/runners.py", line 169, in instantiate_process
    return instantiate_process(self, process, **inputs)
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 64, in instantiate_process
    process = process_class(runner=runner, inputs=inputs)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 195, in __call__
    call_with_super_check(inst.init)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/base/utils.py", line 31, in call_with_super_check
    wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 188, in init
    super().init()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/base/utils.py", line 16, in wrapper
    wrapped(self, *args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/plumpy/processes.py", line 309, in init
    identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
  File "/opt/conda/lib/python3.9/site-packages/plumpy/communications.py", line 141, in add_rpc_subscriber
    return self._communicator.add_rpc_subscriber(converted, identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 215, in add_rpc_subscriber
    return self._loop_scheduler.await_(
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 164, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 258, in __step
    result = coro.throw(exc)
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in coro
    res = await awaitable
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 483, in add_rpc_subscriber
    identifier = await msg_subscriber.add_rpc_subscriber(subscriber, identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 124, in add_rpc_subscriber
    rpc_queue = await self._channel.declare_queue(exclusive=True, arguments=self._rmq_queue_arguments)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/robust_channel.py", line 173, in declare_queue
    queue = await super().declare_queue(
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/channel.py", line 325, in declare_queue
    await queue.declare(timeout=timeout)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/queue.py", line 92, in declare
    self.declaration_result = await asyncio.wait_for(
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 703, in queue_declare
    return await self.rpc(
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 168, in wrap
    return await self.create_task(func(self, *args, **kwargs))
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 25, in __inner
    return await self.task
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 121, in rpc
    raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None

2023-11-17 10:06:11 [10046 | WARNING]: Process<33660>: no connection available to broadcast state change from running to excepted
2023-11-17 10:06:11 [10047 |   ERROR]: Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/plumpy/processes.py", line 888, in on_close
    cleanup()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/communications.py", line 144, in remove_rpc_subscriber
    return self._communicator.remove_rpc_subscriber(identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 221, in remove_rpc_subscriber
    return self._loop_scheduler.await_(self._communicator.remove_rpc_subscriber(identifier))
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 164, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 258, in __step
    result = coro.throw(exc)
  File "/opt/conda/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in coro
    res = await awaitable
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 488, in remove_rpc_subscriber
    await msg_subscriber.remove_rpc_subscriber(identifier)
  File "/opt/conda/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 141, in remove_rpc_subscriber
    await rpc_queue.cancel(identifier)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/robust_queue.py", line 140, in cancel
    result = await super().cancel(consumer_tag, timeout, nowait)
  File "/opt/conda/lib/python3.9/site-packages/aio_pika/queue.py", line 264, in cancel
    return await asyncio.wait_for(
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 395, in basic_cancel
    return await self.rpc(
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 168, in wrap
    return await self.create_task(func(self, *args, **kwargs))
  File "/opt/conda/lib/python3.9/site-packages/aiormq/base.py", line 25, in __inner
    return await self.task
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/opt/conda/lib/python3.9/site-packages/aiormq/channel.py", line 121, in rpc
    raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None

2023-11-17 10:06:22 [10048 |  REPORT]: [33660|DielectricWorkChain|on_terminated]: cleaned remote folders of calculations: 33668 33679 33696"

Hi @AndresOrtegaGuerrero, the only difference I can see between running locally and inside the container is the version of rabbitmq, maybe you can did a test on switch the version of it. If I guess correctly, you have 3.9.13 means you use the arm64 image. Do you have an amd64 machine to test, since it will use a recommended version of rmq which is 3.8.15.
Or you can use rmq 3.9.13 in your local machine to see if the issue can be reproduced.