KeyError: "duplicate label in namespace" when running workchain

I am trying to solve this issue Excepted workchains, due to strange error from kiwipy/plumpy? - #20 by Anooja_Jayaraj using the solution Excepted workchains, due to strange error from kiwipy/plumpy? - #21 by sphuber. I get the following error

verdi process report 10292
2024-02-29 09:55:00 [2553 | REPORT]: [10292|CoulombDiamondsWorkChain|on_except]: Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/plumpy/process_states.py", line 227, in execute
    result = self.run_fn(*self.args, **self.kwargs)
  File "/home/aiida/aiida-core/src/aiida/engine/processes/workchains/workchain.py", line 313, in _do_step
    finished, stepper_result = self._stepper.step()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/workchains.py", line 295, in step
    finished, result = self._child_stepper.step()
  File "/opt/conda/lib/python3.9/site-packages/plumpy/workchains.py", line 246, in step
    return True, self._fn(self._workchain)
  File "/home/aiida/plugins/aiida-quantum-transport/src/aiida_quantum_transport/workchains/coulomb_diamonds.py", line 322, in run_dmft_converge_mu
    "remote_results_folder": self.ctx.hybridization.outputs.remote_results_folder,
  File "/home/aiida/aiida-core/src/aiida/orm/utils/managers.py", line 132, in __getattr__
    return self._get_node_by_link_label(label=name)
  File "/home/aiida/aiida-core/src/aiida/orm/utils/managers.py", line 85, in _get_node_by_link_label
    attribute_dict = self._construct_attribute_dict(self._incoming)
  File "/home/aiida/aiida-core/src/aiida/orm/utils/managers.py", line 57, in _construct_attribute_dict
    return AttributeDict(links.nested())
  File "/home/aiida/aiida-core/src/aiida/orm/utils/links.py", line 367, in nested
    raise KeyError(f"duplicate label '{port_name}' in namespace '{'.'.join(port_namespaces)}'")
KeyError: "duplicate label 'hybridization_file' in namespace ''"

If I look at the outgoing links of this node I see that there are multiple links to the same file as shown below

In [3]: hybr = node.called_descendants[-1]

In [4]: hybr.base.links.get_outgoing().all()
Out[4]: 
[LinkTriple(node=<FolderData: uuid: e068bdd3-e5d9-4d5a-8d5d-3763c3dd6216 (pk: 11596)>, link_type=<LinkType.CREATE: 'create'>, link_label='retrieved'),
 LinkTriple(node=<RemoteData: uuid: 1af5e643-2dce-4033-88e0-e9a969ce6f41 (pk: 11598)>, link_type=<LinkType.CREATE: 'create'>, link_label='remote_results_folder'),
 LinkTriple(node=<SinglefileData: uuid: 794979f2-1811-472a-997a-0f535c1deffc (pk: 11599)>, link_type=<LinkType.CREATE: 'create'>, link_label='hybridization_file'),
 LinkTriple(node=<SinglefileData: uuid: bffe2769-3f42-453d-a308-3ac1e093910a (pk: 11601)>, link_type=<LinkType.CREATE: 'create'>, link_label='energies_file'),
 LinkTriple(node=<SinglefileData: uuid: 864c0dad-9ce5-4fc8-b869-c4fd3a89b131 (pk: 11603)>, link_type=<LinkType.CREATE: 'create'>, link_label='hamiltonian_file'),
 LinkTriple(node=<SinglefileData: uuid: 7739ee7d-d1b8-4e43-aa6d-6763aa47c8d5 (pk: 11605)>, link_type=<LinkType.CREATE: 'create'>, link_label='eigenvalues_file'),
 LinkTriple(node=<SinglefileData: uuid: c2c2cfbd-2a2d-4248-af68-9a4e04bcf5e1 (pk: 11607)>, link_type=<LinkType.CREATE: 'create'>, link_label='matsubara_hybridization_file'),
 LinkTriple(node=<SinglefileData: uuid: 19f4a2ce-1a7d-4fe3-bf7d-2d99ab78225c (pk: 11609)>, link_type=<LinkType.CREATE: 'create'>, link_label='matsubara_energies_file'),
 LinkTriple(node=<SinglefileData: uuid: 8b29c576-6856-41c8-8529-2b0e236b9fdc (pk: 11611)>, link_type=<LinkType.CREATE: 'create'>, link_label='occupancies_file'),
 LinkTriple(node=<SinglefileData: uuid: a76585f6-e69b-4181-90fd-9773075448e2 (pk: 11600)>, link_type=<LinkType.CREATE: 'create'>, link_label='hybridization_file'),
 LinkTriple(node=<SinglefileData: uuid: fe333441-a266-418c-be90-dc4f8532c182 (pk: 11602)>, link_type=<LinkType.CREATE: 'create'>, link_label='energies_file'),
 LinkTriple(node=<SinglefileData: uuid: a37907b5-d697-4743-8c72-e43f17fbbeee (pk: 11604)>, link_type=<LinkType.CREATE: 'create'>, link_label='hamiltonian_file'),
 LinkTriple(node=<SinglefileData: uuid: 53454fcf-754b-4f2a-98a5-05bd4c4124f6 (pk: 11606)>, link_type=<LinkType.CREATE: 'create'>, link_label='eigenvalues_file'),
 LinkTriple(node=<SinglefileData: uuid: cc63d07d-82ba-4efa-9994-68dd011a8648 (pk: 11608)>, link_type=<LinkType.CREATE: 'create'>, link_label='matsubara_hybridization_file'),
 LinkTriple(node=<SinglefileData: uuid: 619668da-79b0-409b-b022-626cd392fec5 (pk: 11610)>, link_type=<LinkType.CREATE: 'create'>, link_label='matsubara_energies_file'),
 LinkTriple(node=<SinglefileData: uuid: d789c408-5564-4d9d-927f-fbdf845f562d (pk: 11612)>, link_type=<LinkType.CREATE: 'create'>, link_label='occupancies_file'),
 LinkTriple(node=<RemoteData: uuid: 7553e91e-374e-4ebb-927c-5d232c77fa32 (pk: 11219)>, link_type=<LinkType.CREATE: 'create'>, link_label='remote_folder')]

I am not sure this issue is linked to my initial issue or the fix to that issue. Rather, as far as I understand from other threads, this issue might be due to running my workchains with several daemons leading to multiple daemons trying to access the same calculations. I am going to assume I won’t to get this issue if I run with just one daemon (haven’t tested it). Is there a solution to this issue? Will the issue be solved by downgrading my RabbitMQ to ~=3.7? (I currently use version 3.10.18)

That behavior would indeed be consistent with an unsupported version of RabbitMQ and multiple daemon workers. Instead of downgrading RabbitMQ, you can also make your RabbitMQ compatible by configuring it. See this wiki page for instructions: RabbitMQ version to use · aiidateam/aiida-core Wiki · GitHub
At least for Linux based OS’es that works pretty well and is probably easier than downgrading the version.

Hey @sphuber, I was working with Anooja on this issue today. Let me note a few things:

  1. A similar issue was reported by Giovanni a couple of years back
  2. Anooja is running on an aiida-core-with-services container
  3. I tried the instructions in the RabbitMQ configuration article you referenced. However, the service rabbitmq-server restart step fails, as rabbitmq-server is not recognized as a service.

Hmm, that is interesting. If you are working with the container, that should already contain the fix that I mentioned. The rabbitmq-server is indeed not registered as a system service, which is not really possible in containers, so it is not suprising that that command doesn’t works. Does sudo rabbitmqctl environment work? We would have to check that the consumer_timeout setting is properly disabled.

I am not familiar with the container, though, so we probably also should get @jusong.yu in here who has developed it.

What would be helpful is to try and run with just a single daemon worker. If the problem is related to RabbitMQ then having just a single daemon worker should not cause these problems. That could help narrow it down a bit.

I talk to Edan in person and didn’t have too much idea on why the container would cause the issue. He also mentioned using one daemon didn’t have the issue (@edan-bainglass correct me if I am wrong).

Since this issue happened consistently (although failed on random calcjobs), maybe @Anooja_Jayaraj can provide the plugin and the the detail steps how you run it. What I propose to try is using aiida-core-base image with different version of rabbitmq container in docker-compose.

I hope and think it should not be a problem with running inside the container, since we have similar setup in aiidalab and running things in production.

At least for Linux based OS’es that works pretty well and is probably easier than downgrading the version.

This is exactly what we have in the container.

Right, I forgot to mention that indeed, sudo rabbitmqctl environment | grep consumer_timeout did work and returned 3600000 (1000 hrs), so even if not undefined as your fix recommends, surely sufficient. And yet race conditions occur.

And yes, as @jusong.yu mentioned, Anooja did not experience these issues when working with a single daemon worker.

I’ll work with Anooja to test further (docker-compose suggestion, off the container, etc.)

The unit of the returned value is milliseconds not seconds, if it return 3600000 it is actually 1h, which is not enough. But in Docker: Set RMQ consumer_timeout to undefined by unkcpz · Pull Request #6189 · aiidateam/aiida-core · GitHub we did change it to undefined.

Can you check which image you are using? (docker image list and check the image ID).

Also a bit confused by how you run command in container with sudo? We didn’t set the password for the aiida user, so in principle you can not sudo. But sure, if you are in shell you can run rabbitmqctl without sudo and get the output.

Sorry. Not sudo rabbitmqctl but first hop on container as root, then run rabbitmqctl.

Okay, if the units are milliseconds, then indeed that may be it, at least partially. I’ll see if switching to the latest image solves the issue. Note that if it does, several documentation sources should be updated, including @sphuber’s guide reference up top.

@jusong.yu and I were discussing creating a channel here for AiiDA troubleshooting - a collection of common “bugs” and their solutions.

That said, it is unclear how (if the above indeed fixes the issue) the consumer_timeout fixes race conditions. I’ll take a closer look at the kiwipy docs, as well as the AEP to replace RabbitMQ.

Do you mean the wiki? Because the Linux version suggests to use the complete disabling instead of setting a long timeout. And for MacOS it correctly mentions that it is in milliseconds.

When a daemon worker gets a task from RabbitMQ to run a process, RabbitMQ locks that process to that worker, guaranteeing it is not being run by any other worker. However, it has this consumer timeout and if the task is not acknowledged by that time (meaning the worker says it was fully completed) it will reschedule and send it to any other worker that is listening. If you have multiple workers, this will then mean that a second worker may get the task, while the original one is still working on it. Now you have two workers running the same task and you get these exceptions.

I mean that the wiki instructions do not apply to containers

Well there you go. That is precisely the issue. Thanks for the clear explanation @sphuber.

Thanks @sphuber and @jusong.yu. The latest aiida-with-services image indeed resolved the above issue.

1 Like