Unexpected Deamon Crash

Dear AiiDA developers,

I am experiencing an issue with running AiiDA v2.7.3 on my macOS computer.

I initially installed AiiDA with PostgreSQL v14.21 (using Homebrew) and RabbitMQ v4.2.4 (using via the Generic Unix Binary Build*).

When I start the server, everything appears to work correctly, and the daemon successfully submits jobs, sometimes running for several minutes. However, after some time, the daemons, which are surprisingly slow, crash. The Circus log file contains the following error:

2026-02-26 16:01:35 circus[18649] [INFO] Starting master on pid 18649
2026-02-26 16:01:35 circus[18649] [INFO] Arbiter now waiting for commands
2026-02-26 16:01:35 circus[18649] [INFO] aiida-aiida started
2026-02-26 16:01:35 circus[18649] [INFO] circusd-stats started
2026-02-26 16:01:35 circus[18651] [INFO] Starting the stats streamer
2026-02-26 16:04:33 circus[18649] [INFO] Arbiter exiting
2026-02-26 16:04:33 circus[18651] [INFO] Stats streamer stopped
2026-02-26 16:04:33 tornado.general[18651] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104766ea0> attached to closed socket: Socket operation on non-socket
2026-02-26 16:04:33 circus[18651] [INFO] Stats streamer stopped
2026-02-26 16:04:33 circus[18651] [INFO] Stats streamer stopped
2026-02-26 16:04:33 circus[18649] [INFO] circusd-stats stopped
2026-02-26 16:04:34 circus[18649] [INFO] aiida-aiida stopped
2026-02-26 16:04:35 circus[19070] [INFO] Starting master on pid 19070
2026-02-26 16:04:35 circus[19070] [INFO] Arbiter now waiting for commands
2026-02-26 16:04:35 circus[19070] [INFO] aiida-aiida started
2026-02-26 16:04:35 circus[19070] [INFO] circusd-stats started
2026-02-26 16:04:35 circus[19072] [INFO] Starting the stats streamer
2026-02-26 16:05:00 circus[19070] [INFO] Arbiter exiting
2026-02-26 16:05:00 circus[19072] [INFO] Stats streamer stopped
2026-02-26 16:05:00 tornado.general[19072] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x111b33770> attached to closed socket: Socket operation on non-socket
2026-02-26 16:05:00 circus[19072] [INFO] Stats streamer stopped
2026-02-26 16:05:00 circus[19072] [INFO] Stats streamer stopped
2026-02-26 16:05:00 circus[19070] [INFO] circusd-stats stopped
2026-02-26 16:05:01 tornado.application[19070] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x114500b00>>
Traceback (most recent call last):
File “/Users/gjoalland/Library/Python/3.12/lib/python/site-packages/tornado/ioloop.py”, line 945, in _run
val = self.callback()
^^^^^^^^^^^^^^^

File “/Users/gjoalland/Library/Python/3.12/lib/python/site-packages/circus/util.py”, line 1038, in wrapper
raise ConflictError(“arbiter is already running %s command”
circus.exc.ConflictError: arbiter is already running watcher_decr command
2026-02-27 14:14:00 tornado.application[20783] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x10b45d430>>
Traceback (most recent call last):
File “/Users/gjoalland/Library/Python/3.12/lib/python/site-packages/tornado/ioloop.py”, line 945, in _run
val = self.callback()
^^^^^^^^^^^^^^^
File “/Users/gjoalland/Library/Python/3.12/lib/python/site-packages/circus/util.py”, line 1038, in wrapper
raise ConflictError(“arbiter is already running %s command”
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2026-03-02 14:40:42 circus[20783] [INFO] aiida-aiida stopped

Could someone please help me find a solution to this error?

Thank you in advance.

*I also attempted to install RabbitMQ directly via Homebrew, but encountered Erlang compatibility issues (as mentioned here). This led me to switch to the Generic Unix Binary Build.

Hi,

I’m also on macOS so I can relate to the RabbitMQ struggles! :grinning_face_with_smiling_eyes:

Looking at your Circus logs, the error:
ConflictError: arbiter is already running watcher_decr command

This usually happens when RabbitMQ drops the connection
unexpectedly and Circus receives conflicting stop/restart
commands at the same time it’s not really an AiiDA bug
itself but more of a RabbitMQ stability issue on macOS.

A few things that might help:

  1. First check if RabbitMQ is actually stable:
    rabbitmqctl status

    Check the RabbitMQ logs around the same timestamps
    as your Circus crashes — you’ll likely see connection
    drops there.

  2. Try adding a heartbeat to your rabbitmq.conf:
    heartbeat = 60

    This helped stabilize my setup on macOS.

  3. Also worth trying:
    verdi config set daemon.timeout 60

The Generic Unix Binary Build for RabbitMQ can sometimes
be a bit flaky on macOS compared to a native install.
I know you mentioned Erlang compatibility issues with
Homebrew — did you try the specific Erlang version
mentioned in the RabbitMQ compatibility matrix?
That fixed the Homebrew install for me personally.

Hope this helps, let me know if you find the root cause
curious to know what ends up being the fix on your end too!

Thank you for your feedbacks.

Yes, I installed the specific version or erlang compatible with the version of rabbitmq I installed. I did it using homebrew in fact: brew install erlang@27

Interesting! I tried installing rabbitmq after having installed erlang@27. But when I type brew install rabbitmq, it automatically install the latest release of erlang (version 28) as well, ignoring erland@27. How did you manage to force homebrew to use erlang@27 instead?

According to rabbitmqcrl status, the server should be running smoothly but ineed in the logs I see from time to time some connexion errors:
2026-02-26 15:36:53.904281+01:00 [warning] <0.1473.0> client unexpectedly closed TCP connection

I have been testing your suggestions, during the past two days and unfortunately the daemon still crashes after some time due to the same connexion error.

Thanks @gjoalland13 for opening the topic, and @aryansri05 for your comments! As a fellow macOS user - with a Homebrew install - I’m somewhat surprised and saddened by your poor experience. I’ve been using RabbitMQ on my machine without any issues.

I’m admittedly no expert on the AiiDA engine, so debugging issues with the daemon is always tricky. The logs can also be deceiving and give a lot of warnings/errors that are unrelated to the actual issue you are facing. While investigating I also saw a lot of warnings related to the connection in the RabbitMQ logs:

2026-03-06 10:59:56.652014+01:00 [warning] <0.919.0> client unexpectedly closed TCP connection

But that apparently was related to verdi status not properly closing the broker connection (I opened a PR to fix this, see 🐛 `verdi status`: close broker connection after check by mbercx · Pull Request #7269 · aiidateam/aiida-core · GitHub).

Similarly, I’m unsure if the error you report in the circus logs is the actual cause of your troubles, and if this is related to RabbitMQ at all. circus takes care of managing the daemon workers. As I understand it, the error you are seeing is simply a consequence of your verdi command conflicting with the manage_watchers command that circus runs every second or so. I can reproduce it pretty consistently with:

for i in (seq 1 10); verdi daemon start; sleep 1;  verdi daemon stop; end

(fish example, adapt for your fav shell). Tracking my circus logs, I get plenty of:

2026-03-06 11:47:18 tornado.application[45922] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x125e84430>>
Traceback (most recent call last):
  File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/tornado/ioloop.py", line 937, in _run
    val = self.callback()
  File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command

What I think happens is that while arbiter_stop is running, triggered by verdi daemon stop, the manage_watchers command runs, which then conflicts with the arbiter_stop. In your other error, you most likely ran verdi daemon decr, which triggers watcher_decr, which then raises the corresponding ConflictError when manage_watchers is run.

Also see the following (rather old, but I think still correct) comment:

So at this stage I haven’t really seen any evidence that RabbitMQ is to blame for your daemon crashing. Maybe we should have a debugging session on Teams to see if we can’t figure it out together. :slight_smile: How reliably can you reproduce the issue, and confidently report the time the crash happened?

1 Like

Thanks @mbercx for your answer!

How reliably can you reproduce the issue, and confidently report the time the crash happened?

I can pinpoint the exact moment the issue occurs, as I’ve run several tests and encountered it often. However, I’m not sure how to reliably reproduce it on demand.

Regarding the rest of your message, I’m starting to wonder if the daemon crashes due to traffic congestion (e.g., overloading). I’ve used up to 80% of the available daemon worker slots, so in theory it probably isn’t this but it might explain why the daemon are slow at least. This would also align with the fact that you can reproduce the error by spamming the daemon workers with instructions.

I’d really appreciate your help with this! Scheduling a debugging session might be the most efficient way to sort it out indeed!

Great, I’ll DM you so we can get to the bottom of this. :slight_smile: