Error reading SSH protocol banner but verdi computer test passes successfully

I am trying to run a code on a remote computer (tigu) from an aiida-core-with-services container. The calculation fails to upload to the remote machine due the following error

 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2271, in _check_banner
 |     buf = self.packetizer.readline(timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
 |     buf += self._read_timeout(timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/packet.py", line 622, in _read_timeout
 |     raise socket.timeout()
 | socket.timeout
 | 
 | During handling of the above exception, another exception occurred:
 | 
 | Traceback (most recent call last):
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 202, in exponential_backoff_retry
 |     result = await coro()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 85, in do_upload
 |     transport = await cancellable.with_interrupt(request)
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/utils.py", line 112, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/engine/transports.py", line 86, in do_open
 |     transport.open()
 |   File "/opt/conda/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 497, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/client.py", line 421, in connect
 |     t.start_client(timeout=timeout)
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 699, in start_client
 |     raise e
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2094, in run
 |     self._check_banner()
 |   File "/opt/conda/lib/python3.9/site-packages/paramiko/transport.py", line 2275, in _check_banner
 |     raise SSHException(
 | paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
+-> WARNING at 2024-02-20 09:00:33.293786+00:00
 | maximum attempts 5 of calling do_upload, exceeded

I am able to ssh into the remote machine from the container without any issues. Also, testing the remote machine runs successfully

verdi computer test tigu
Report: Testing computer<tigu> for user<aiida@localhost>...
* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... [OK]: 1 jobs found in the queue
* Determining remote user name... [OK]: jayn
* Creating and deleting temporary file... [OK]
* Checking for possible delay from using login shell... [OK]
Success: all 6 tests succeeded

The calculations were able to upload to the remote machine and run successfully until yesterday. Any ideas?

Try restarting the daemon with verdi daemon restart --reset. I think the key agent may have expired and so it can no longer authenticate. After restarting the daemon, run verdi process play --all to play all paused processes

Thanks! That did it. What exactly does the --reset flag do with regards to the key?

@sphuber, in the view of those issues with the current SSH plugin, shall we try to complete the work on the Fabric one? If you need any assistance with testing or anything - let me know.

1 Like

I am not sure the --reset flag is even necessary, probably verdi daemon restart would have been enough. When you restart the daemon, the workers, which are system processes, are also restarted. When they restart, their state is refreshed. I suppose that the key agent is also refreshed.

The current state is that @mbercx wanted to continue testing it. A number of small issues have been found when testing connections to the various machines he uses. Once testing is done, nothing is blocking it as far as I can tell.

Note that it is not immediately obvious to me that this particular issue would have been solved by switching to fabric though. Since this is a problem with paramiko and fabric still uses paramiko under the hood, it seems likely the same problem would arise.

Note that it is not immediately obvious to me that this particular issue would have been solved by switching to fabric though. Since this is a problem with paramiko and fabric still uses paramiko under the hood, it seems likely the same problem would arise.

I summarised in the issue that there is a discrepancy between doing a verdi computer test and running a calculation. I think this discrepancy is at the root of the problem.

@Anooja_Jayaraj mentioned, that the verdi computer test went fine. To my understanding, when running this command, aiida reads the information from the .ssh/config file. At the same time, the simulation raised a connection problem, which means it didn’t use precisely the same configuration.

So, I guess it is not about the choice of a library (fabric or paramiko) but more about how to use them. If the transport plugin always uses a single configuration source (ideally, ~/.ssh/config) then there should be no issues.

How does this explain that it was working before but stopped working after a certain amount of time and that just restarting the daemon fixes the problem. We have seen this problem before that the authentication of a daemon worker goes stale over time. I don’t understand how your analysis explains that.

@Anooja_Jayaraj by the way, the machine that AiiDA is running on, are you connecting to that machine over SSH? I just remembered that I had a similar problem as yourself long ago. AiiDA was running on a machine that I SSH’ed into from my own desktop. Whenever I disconnected the shell in which I ran verdi daemon start, the connection started failing. This is why I believe this is due to the agent. The solution was to a) keep the shell that started the daemon open (but this is not very practical) or therefore b) use something like screen to keep the shell running in the background, even if you disconnect. This fixed the problem for me of having to restart the daemon after I disconnected.

AiiDA is running on an aiida-core-with-services container on my laptop.

Ok, that may have a similar effect. If you open an interactive shell in the container and start the daemon there, and then close the shell, the daemon may lose authentication access to the remote cluster. Should be easy to test. Just open the shell, start the daemon, check that processes are running as intended, then close the shell. Then after a while open a new shell and check that calcjobs are failing to connect. Not sure what the equivalent of the screen solution would be for Docker containers. @jusong.yu do you have experience with this perhaps?

I didn’t encounter the issue.

BTW, if the daemon is start in the shell which opened by the docker run or docker start, then that is the main shell session and log out from there will stop the container. So I don’t think it is very related to the container either.