WorkChain exception inside AiidaLab container

Hi @jusong.yu , i will share you the details
my personal machine has:

 ✔ version:     AiiDA v2.4.1
 ✔ config:      /Users/ogaa/.aiida
 ✔ profile:     AndresOrtegaGuerrero
 ✔ storage:     Storage for 'AndresOrtegaGuerrero' [open] @ postgresql://aiida_qs_ogaa_19acf89387654dacae1d6dbe285b9eb7:***@localhost:5432/AndresOrtegaGuerrero_ogaa_19acf89387654dacae1d6dbe285b9eb7 / DiskObjectStoreRepository: 8be926fae05f4e6d95054f4bf44bcb3f | /Users/ogaa/.aiida/repository/AndresOrtegaGuerrero/container
 ✔ rabbitmq:    Connected to RabbitMQ v3.12.4 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ✔ daemon:      Daemon is running with PID 6428

and in the container

 ✔ version:     AiiDA v2.4.1
 ✔ config:      /home/jovyan/.aiida
 ✔ profile:     default
 ✔ storage:     Storage for 'default' [open] @ postgresql://aiida:***@localhost:5432/aiida_db / DiskObjectStoreRepository: 406d090665c941ef98807cc2109af721 | /home/jovyan/.aiida/repository/default/container
 ✔ rabbitmq:    Connected to RabbitMQ v3.9.13 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ✔ daemon:      Daemon is running with PID 253

Ah, then should not be the problem of rmq.
I notice @sphuber release a new plumpy with Catch `ChannelInvalidStateError` in process state change by sphuber · Pull Request #278 · aiidateam/plumpy · GitHub fix as he posted. Can you show the plumpy version in your contanier? You need to update it with pip install -U I guess?

I may have been a bit quick by dismissing the possibility of this problem being related to the version of RabbitMQ. The reason we say that RabbitMQ above 3.8.15 is not supported is because the default timeout is very short (order of 10 minutes) which can cause tasks to be lost. This is not the problem we are seeing here, where it is the connection being shut down.

However, I all of a sudden do remember looking at this in the past and actually finding a problem with a particular version of RabbitMQ related to connection stability. Since we have had quite some problems with it, I thought it had to be on our end, but actually vaguely remember now finding that there was a bug in RabbitMQ itself. I doesn’t strike me as too unlikely that it was 3.9.13 in fact.

I have been trying to find the discussion again, but it could have been anywhere. Somewhere on Github, or the old mailing list and so have not been able to find it. For now, it would be really best to upgrade RabbitMQ. If you want to keep 3.9, try to install a later patch version. The latest should be 3.9.29: Release RabbitMQ 3.9.29 · rabbitmq/rabbitmq-server · GitHub

I notice @sphuber release a new plumpy with Catch ChannelInvalidStateError in process state change by sphuber · Pull Request #278 · aiidateam/plumpy · GitHub fix as he posted.

I think they did install this update. However, the exception now crops up elsewhere in the code. Same problem, but at different point in the process’ lifetime

I built RabbitMQ 3.9.13 in a Docker container and ran the IRamanSpectraWorkChain of aiida-vibroscopy and it completed without problems.

(aiida-py311) sph@invader:~/code/aiida/env/dev/aiida-core$ verdi process status 154986
IRamanSpectraWorkChain<154986> Finished [0] [2:if_(should_run_average)]
    └── HarmonicWorkChain<154988> Finished [0] [4:if_(should_run_phonopy)]
        ├── generate_preprocess_data<154989> Finished [0]
        ├── PhononWorkChain<154994> Finished [0] [7:if_(should_run_phonopy)]
        │   ├── generate_preprocess_data<154999> Finished [0]
        │   ├── get_supercell<155012> Finished [0]
        │   ├── create_kpoints_from_distance<155014> Finished [0]
        │   ├── PwBaseWorkChain<155020> Finished [0] [3:results]
        │   │   └── PwCalculation<155023> Finished [0]
        │   ├── get_supercells_with_displacements<155040> Finished [0]
        │   ├── PwBaseWorkChain<155044> Finished [0] [3:results]
        │   │   └── PwCalculation<155050> Finished [0]
        │   ├── PwBaseWorkChain<155046> Finished [0] [3:results]
        │   │   └── PwCalculation<155053> Finished [0]
        │   └── generate_phonopy_data<155087> Finished [0]
        ├── DielectricWorkChain<154998> Finished [0] [11:results]
        │   ├── create_kpoints_from_distance<155000> Finished [0]
        │   ├── create_directional_kpoints<155004> Finished [0]
        │   ├── create_directional_kpoints<155007> Finished [0]
        │   ├── PwBaseWorkChain<155011> Finished [0] [3:results]
        │   │   └── PwCalculation<155017> Finished [0]
        │   ├── PwBaseWorkChain<155036> Finished [0] [3:results]
        │   │   └── PwCalculation<155039> Finished [0]
        │   ├── compute_critical_electric_field<155060> Finished [0]
        │   ├── get_accuracy_from_critical_field<155062> Finished [0]
        │   ├── get_electric_field_step<155064> Finished [0]
        │   ├── PwBaseWorkChain<155068> Finished [0] [3:results]
        │   │   └── PwCalculation<155072> Finished [0]
        │   ├── PwBaseWorkChain<155069> Finished [0] [3:results]
        │   │   └── PwCalculation<155075> Finished [0]
        │   ├── PwBaseWorkChain<155099> Finished [0] [3:results]
        │   │   └── PwCalculation<155105> Finished [0]
        │   ├── PwBaseWorkChain<155102> Finished [0] [3:results]
        │   │   └── PwCalculation<155108> Finished [0]
        │   ├── PwBaseWorkChain<155121> Finished [0] [3:results]
        │   │   └── PwCalculation<155127> Finished [0]
        │   ├── PwBaseWorkChain<155124> Finished [0] [3:results]
        │   │   └── PwCalculation<155130> Finished [0]
        │   ├── subtract_residual_forces<155143> Finished [0]
        │   └── NumericalDerivativesWorkChain<155148> Finished [0] [None]
        │       ├── generate_preprocess_data<155149> Finished [0]
        │       ├── compute_nac_parameters<155151> Finished [0]
        │       ├── compute_susceptibility_derivatives<155155> Finished [0]
        │       ├── join_tensors<155160> Finished [0]
        │       ├── join_tensors<155162> Finished [0]
        │       └── join_tensors<155164> Finished [0]
        ├── elaborate_tensors<155166> Finished [0]
        ├── generate_vibrational_data_from_phonopy<155168> Finished [0]
        ├── elaborate_tensors<155170> Finished [0]
        ├── generate_vibrational_data_from_phonopy<155172> Finished [0]
        ├── elaborate_tensors<155174> Finished [0]
        └── generate_vibrational_data_from_phonopy<155176> Finished [0]

I do notice though that in your example, the PhononWorkChain launches many more subprocesses. I believe they are launched in parallel, correct? Since this error may be dependent on many processes being run in parallel, my example may be too simplistic to trigger it.

@sphuber the resources you use for pw.x ? are the localhost or external? because when i do small test in the localhost the workchain doesnt present any issues

It is run on localhost indeed.

because when i do small test in the localhost the workchain doesnt present any issues

So you are saying it only fails when running in the container and pw.x is run on a computer outside of the container (over SSH?), is that right?
Could you the exact same workchain in the container but run the calculations on the localhost? You would have to install QE in the container, but if it is running Ubuntu, you can even sudo apt install it I believe.

I tried a relatively “bigger” system , in the localhost , i got the same error

Could you share the inputs please, the structure at least? Then I will try and run it as well and see if I can reproduce it with RabbitMQ 3.9.13

data_image0
_chemical_formula_structural       MoS2MoS2MoS2MoS2
_chemical_formula_sum              "Mo4 S8"
_cell_length_a       6.38448
_cell_length_b       6.38448
_cell_length_c       13.3783
_cell_angle_alpha    90
_cell_angle_beta     90
_cell_angle_gamma    120

_space_group_name_H-M_alt    "P 1"
_space_group_IT_number       1

loop_
  _space_group_symop_operation_xyz
  'x, y, z'

loop_
  _atom_site_type_symbol
  _atom_site_label
  _atom_site_symmetry_multiplicity
  _atom_site_fract_x
  _atom_site_fract_y
  _atom_site_fract_z
  _atom_site_occupancy
  Mo  Mo1       1.0  0.16667  0.33333  0.75000  1.0000
  S   S1        1.0  0.33333  0.16667  0.63308  1.0000
  S   S2        1.0  0.33333  0.16667  0.86692  1.0000
  Mo  Mo2       1.0  0.16667  0.83333  0.75000  1.0000
  S   S3        1.0  0.33333  0.66667  0.63308  1.0000
  S   S4        1.0  0.33333  0.66667  0.86692  1.0000
  Mo  Mo3       1.0  0.66667  0.33333  0.75000  1.0000
  S   S5        1.0  0.83333  0.16667  0.63308  1.0000
  S   S6        1.0  0.83333  0.16667  0.86692  1.0000
  Mo  Mo4       1.0  0.66667  0.83333  0.75000  1.0000
  S   S7        1.0  0.83333  0.66667  0.63308  1.0000
  S   S8        1.0  0.83333  0.66667  0.86692  1.0000

Hi Sebastiaan,
We are try connect to remote HPC from conda based aiida image in Singularity container which is runs on compute node of hpc VNC.

Thanks
Arun prasad.

This one is a smaller cif, i got the rame issue with the DielectricWorkChain

# generated using pymatgen
data_WS2
_symmetry_space_group_name_H-M   P6_3/mmc
_cell_length_a   3.18422289
_cell_length_b   3.18422289
_cell_length_c   12.97828236
_cell_angle_alpha   90.00000000
_cell_angle_beta   90.00000000
_cell_angle_gamma   120.00000000
_symmetry_Int_Tables_number   194
_chemical_formula_structural   WS2
_chemical_formula_sum   'W2 S4'
_cell_volume   113.96061147
_cell_formula_units_Z   2
loop_
 _symmetry_equiv_pos_site_id
 _symmetry_equiv_pos_as_xyz
  1  'x, y, z'
  2  '-x, -y, -z'
  3  'x-y, x, z+1/2'
  4  '-x+y, -x, -z+1/2'
  5  '-y, x-y, z'
  6  'y, -x+y, -z'
  7  '-x, -y, z+1/2'
  8  'x, y, -z+1/2'
  9  '-x+y, -x, z'
  10  'x-y, x, -z'
  11  'y, -x+y, z+1/2'
  12  '-y, x-y, -z+1/2'
  13  '-y, -x, -z+1/2'
  14  'y, x, z+1/2'
  15  '-x, -x+y, -z'
  16  'x, x-y, z'
  17  '-x+y, y, -z+1/2'
  18  'x-y, -y, z+1/2'
  19  'y, x, -z'
  20  '-y, -x, z'
  21  'x, x-y, -z+1/2'
  22  '-x, -x+y, z+1/2'
  23  'x-y, -y, -z'
  24  '-x+y, y, z'
loop_
 _atom_type_symbol
 _atom_type_oxidation_number
  W4+  4.0
  S2-  -2.0
loop_
 _atom_site_type_symbol
 _atom_site_label
 _atom_site_symmetry_multiplicity
 _atom_site_fract_x
 _atom_site_fract_y
 _atom_site_fract_z
 _atom_site_occupancy
  W4+  W0  2  0.33333333  0.66666667  0.25000000  1
  S2-  S1  4  0.33333333  0.66666667  0.62980494  1

Thanks for the CIF @AndresOrtegaGuerrero . I have been running some tests against different versions of RabbitMQ and I am now pretty sure that the exception you are seeing is in fact due to the RabbitMQ version.

As is explained in AiiDA’s documentation versions of RabbitMQ newer than 3.8.15 add a default timeout for all tasks that run longer than 30 minutes and essentially kill it. It is possible to user more modern versions of RabbitMQ but it requires the consumer_timeout parameter to be configured correctly for the server.

I ran your workchain against a number of RabbitMQ version (3.9.13, 3.9.25 and 3.10.25) and the workchain failed for all of them with exactly the same exception. However, when I configured the consumer_timeout, the workchains all ran fine.The fact that it worked for your local machine with 3.12.4 is probably because you ran a smaller test system and so it finished within 30 minutes and so the problem didn’t manifest itself.

I am a bit surprised that you say that the container you are using is the aiidalab-launch container. That should be maintained I believe by @jusong.yu who should be aware of the RabbitMQ problem. Jason, should that container be used by users? Is it advertised anywhere? If so, we should either make sure it properly configures the RabbitMQ server or we should not advertise the image.

As for Andres: a solution for now is to no longer use the RabbitMQ of the container. You can create another container with an instance that is correctly configured as follows.

  1. Create the file advanced.config with the following content:
%% advanced.config
[ 
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].
  1. Create a new container with RabbitMQ, mounting this config file. Run the following command from the directory containing the config file:
docker run -d --hostname rabbitmq --name rabbitmq \
    -p 5671:5672 \  # Bind port 5672 of container to 5671 of localhost
    -v $(pwd)/advanced.config:/etc/rabbitmq/advanced.config \ # Mount the config file 
    rabbitmq:3.9.25 # You can pick any version here

I map the port to 5671 just in case you already have another RabbitMQ server running on 5672 on the host machine. If that is not the case, you can remove it. If you do keep the mapping, make sure to update your profile in AiiDA to point to the correct port in the config.json.

Now you should have a working setup and your workchains should no longer except after 30 minutes.

Hi @sphuber thank you for the help and time. In my personal machine with RabbitMQ v3.12.4 i used the WorkChain with a bigger system (big cell , 18 atoms in unit cell) and the resources used for pw.x were from daint. This one finished with no problem. I am working with the aiidalab team to integrate this workchain in the Qe App (using the container), i will discuss then with @jusong.yu

Hi @AndresOrtegaGuerrero can you provide the image and the tag you are using (also the digest number of the image, because I guess you are using arm64)? It’ll be easy to reproduce with the container.

Yes, I aware of the version and did add consumer_timeout in https://github.com/aiidalab/aiidalab-docker-stack/blob/dfa65151017362fefeb56d97fed3c1b8f25537c5/stack/base-with-services/before-notebook.d/30_start-rabbitmq-arm64.sh#L19-L23

I am now very sure @AndresOrtegaGuerrero using arm64 image on his MacBook, this is the image not being used quite often in production. Maybe be the consumer_timeout not setup correctly?

If so, we should either make sure it properly configures the RabbitMQ server or we should not advertise the image.

Andres is using AiiDAlab image and the amd64 using RMQ==3.8.14 from conda so should all be fine(and also in all of our AiiDAlab production deployment). I am more worried about the aiida-core-with-services image where only rmq==3.9.13 is installed for both arm64 and amd64. I’ll check it now. But I didn’t advise it yet and only use it myself and the user can only find it officially from the documentation.

Update: I checked the aiida-core-with-services:v2.4.1 image, start the image, and then bash into the container as root, I confirmed that the consumer_timeout setting is valid.

(base) root@31963b87819f:~# rabbitmqctl environment | grep consumer
      {consumer_timeout,3600000},
      {dead_letter_worker_consumer_prefetch,32},
      {default_consumer_prefetch,{false,0}},

Oh, no, the unit of time is in milliseconds, although the change is activated it is too short. I thought it was a second when I wrote it. :smiling_face_with_tear:

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.