Problem in running in HPC: after relogin everything failed

Hi all,

I am able to install aiida through conda in HPC. I successfully submitted the code. But I had to restart my computer. When I logged back in everything has changed.

For example: before restart

(aiidaENV) rkarkee@ch-fe2:/lustre/scratch5/rkarkee> verdi status
 ✔ version:     AiiDA v2.5.1
 ✔ config:      /users/rkarkee/.aiida
 ✔ profile:     quicksetup
 ✔ storage:     Storage for 'quicksetup' [open] @ postgresql://aiida_qs_rkarkee_20554bcc4bead70a3479c4ef8d5f1f4e:***@localhost:5434/quicksetup_rkarkee_20554bcc4bead70a3479c4ef8d5f1f4e / DiskObjectStoreRepository: 362d855165524c4f94089555080ab634 | /users/rkarkee/.aiida/repository/quicksetup/container
 ✔ rabbitmq:    Connected to RabbitMQ v3.8.14 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ✔ daemon:      Daemon is running with PID 218123


aiidaENV) rkarkee@ch-fe2:/lustre/scratch5/rkarkee> verdi process status 24
IRamanSpectraWorkChain<24> Waiting [None]
    └── HarmonicWorkChain<26> Waiting [1:if_(should_run_parallel)]
        ├── generate_preprocess_data<27> Finished [0]
        ├── PhononWorkChain<32> Waiting [2:run_base_supercell]
        │   ├── generate_preprocess_data<33> Finished [0]
        │   ├── get_supercell<39> Finished [0]
        │   └── PwBaseWorkChain<42> Waiting [2:while_(should_run_process)(1:run_process)]
        │       └── PwCalculation<48> Waiting
        └── DielectricWorkChain<38> Waiting [2:run_base_scf]
            └── PwBaseWorkChain<45> Waiting [2:while_(should_run_process)(1:run_process)]
                └── PwCalculation<51> Waiting

And I was also in queue successfully,
  10444536  standard aiida-48  rkarkee PD       0:00      1 (Priority)
  10444537  standard aiida-51  rkarkee PD       0:00      1 (Priority)

after restart:

(aiidaENV) rkarkee@ch-fe1:~> verdi status
 ✔ version:     AiiDA v2.5.1
 ✔ config:      /users/rkarkee/.aiida
 ✔ profile:     quicksetup
 ✘ storage:     Unable to connect to profile's storage.
Error: UnreachableStorage: Could not connect to database: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5434 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8)
 ✔ rabbitmq:    Connected to RabbitMQ v3.8.14 as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
 ✘ daemon:      The daemon could not be reached, seemingly because of a stale PID file. Either stop or start the daemon to remove it and restore the daemon to a functional state.



(aiidaENV) rkarkee@ch-fe1:~> verdi process status 24
Critical: Could not connect to database: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5434 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8)

Can you please suggest how may I fix this?

Apparently Postgres is not running, as suggested by the error message. If you installed with conda, did you also install Postgres in there as a service? If so, each time you restart the machine, you have to manually start that service. See the documentation: Installation into Conda environment — AiiDA 2.5.1.post0 documentation

Hi @sphuber

What do you mean when you say “install Postgres as a service”?

I restarted following:

pg_ctl -D /lustre/scratch5/$USER/aiidaDB -l /lustre/scratch5/$USER/aiidaLOG -o “-F -p 5434” start

rabbitmq-server -detached

verdi daemon start

But now there is another problem on rabbitmq

(aiidaENV) rkarkee@ch-fe1:~> verdi status
 ✔ version:     AiiDA v2.5.1
 ✔ config:      /users/rkarkee/.aiida
 ✔ profile:     quicksetup
 ✔ storage:     Storage for 'quicksetup' [open] @ postgresql://aiida_qs_rkarkee_20554bcc4bead70a3479c4ef8d5f1f4e:***@localhost:5434/quicksetup_rkarkee_20554bcc4bead70a3479c4ef8d5f1f4e / DiskObjectStoreRepository: 362d855165524c4f94089555080ab634 | /users/rkarkee/.aiida/repository/quicksetup/container
 ✘ rabbitmq:    Unable to connect to rabbitmq with URL: amqp://guest:guest@127.0.0.1:5672?heartbeat=600
Error: ConnectionError: [Errno 111] Connect call failed ('127.0.0.1', 5672)
 ✔ daemon:      Daemon is running with PID 77107

Hi @rkarkee, it seems rabbitmq-server -detached commend failed to start the rabbitmq. Can you try to run rabbitmq-server and see what is the error message?

(You may want to check Unable to connect to rabbitmq - #5 by jgarridoa see if it is the similar issue or if the information there helpful to solve your problem.)

Hi @jusong.yu

I think my error is similar to that post. But I do not have sudo access.

(aiidaENV) rkarkee@ch-fe1:~/q-e-qe-7.3/QERaman/bin> rabbitmq-server
Configuring logger redirection
16:08:14.428 [error]

16:08:14.436 [error] BOOT FAILED
BOOT FAILED
16:08:14.437 [error] ===========
===========
16:08:14.437 [error] ERROR: node with name "rabbit" is already running on host "ch-fe1"
ERROR: node with name "rabbit" is already running on host "ch-fe1"
16:08:14.437 [error]

16:08:15.438 [error] Supervisor rabbit_prelaunch_sup had child prelaunch started with rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason {duplicate_node_name,"rabbit","ch-fe1"} in context start_error
16:08:15.438 [error] CRASH REPORT Process <0.153.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{duplicate_node_name,"rabbit","ch-fe1"}}},{rabbit_prelaunch_app,start,[normal,[]]}} in application_master:init/4 line 138
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{duplicate_node_name,\"rabbit\",\"ch-fe1\"}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{duplicate_node_name,"rabbit","ch-fe1"}}},{rabbit_prelaunch_ap

Crash dump is being written to: erl_crash.dump...done

If your rabbitmq also installed by conda, you can try rabbitmqctl stop and run the rabbitmq-server to start it again.

Hi @jusong.yu
I did that but it says unable to perform operation.

(aiidaENV) rkarkee@ch-fe1:~> rabbitmqctl stop
Stopping and halting node rabbit@ch-fe1 ...
Error: unable to perform an operation on node 'rabbit@ch-fe1'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@ch-fe1
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['rabbit@ch-fe1']

rabbit@ch-fe1:
  * connected to epmd (port 4369) on ch-fe1
  * epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic
  * can't establish TCP connection to the target node, reason: econnrefused (connection refused)
  * suggestion: check if host 'ch-fe1' resolves, is reachable and ports 25672, 4369 are not blocked by firewall

Current node details:
 * node name: 'rabbitmqcli-541-rabbit@ch-fe1'
 * effective user's home directory: /users/rkarkee
 * Erlang cookie hash: +CxQENXwMARpmfV6sLK6rQ==

Hi @rkarkee, if you don’t have sudo on the machine, I would guess you may run it on some cluster or the supercomputer center? If so, can you ask the cluster to manage the services? From the error message “can’t establish TCP connection to the target node, reason: econnrefused (connection refused)” and from the fact that you were able to set the rabbitmq, I guess maybe the admin of the cluster stop the service and block it.

For you reference, the PostgreSQL and the RabbitMQ are services that should keep on running, on the supercomputer center they usually don’t allow this kind of services run by user for long time.