Why am I getting "Transport task update was cancelled"?

I’m running Gaussian calculations on a remote cluster(aiida-gaussian), and I’ve been seeing the processes get paused as “Transport task update was cancelled”.

I can resume them with verdi process play my_pk, but I’m just wondering why this might be occurring and if there is anything I can do in configurations or something to prevent it.

Further, is there a way I could do the process play in Python? So I could say, have a loop going to check for paused processes and play them if it finds any.

Thanks

Hi @kmlefran , to understand the cause behind the cancelled tasks, I would need more information. Could you please share the output of verdi process report. That should potentially contain more information.

As for playing paused processes from Python, see the docs on process control API: Usage — AiiDA 2.5.0.post0 documentation

Here is one that is currently running, but was paused at one point. Since it’s so long, I’ve had to remove some of the start

*** 331: CalcJobState.WITHSCHEDULER, scheduler state: JobState.RUNNING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 14 LOG MESSAGES:
±> ERROR at 2024-01-14 17:02:26.235864-05:00
packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
| transport = await request
| ^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 287, in await
| yield self # This tells Task to wait for completion.
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/tasks.py”, line 339, in __wakeup
| future.result()
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 203, in result
| raise self._exception.with_traceback(self._exception_tb)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/transports.py”, line 89, in do_open
| transport.open()
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/transports/plugins/ssh.py”, line 516, in open
| self._client.connect(self._machine, **connection_arguments)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 450, in connect
| self._auth(
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 781, in _auth
| raise saved_exception
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 774, in _auth
| self._transport.auth_interactive_dumb(username)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/transport.py”, line 1711, in auth_interactive_dumb
| return self.auth_interactive(username, handler, submethods)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/transport.py”, line 1688, in auth_interactive
| return self.auth_handler.wait_for_response(my_event)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/auth_handler.py”, line 249, in wait_for_response
| raise AuthenticationException(“Authentication timeout.”)
| paramiko.ssh_exception.AuthenticationException: Authentication timeout.
±> ERROR at 2024-01-15 04:03:40.447453-05:00
| Traceback (most recent call last):
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/utils.py”, line 187, in exponential_backoff_retry
| result = await coro()
| ^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/tasks.py”, line 193, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/utils.py”, line 94, in with_interrupt
| result = await next(wait_iter)
| ^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/tasks.py”, line 605, in _wait_for_one
| return f.result() # May raise f.exception().
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 203, in result
| raise self._exception.with_traceback(self._exception_tb)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/manager.py”, line 132, in _update_job_info
| self._jobs_cache = await self._get_jobs_from_scheduler()
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/manager.py”, line 98, in _get_jobs_from_scheduler
| transport = await request
| ^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 287, in await
| yield self # This tells Task to wait for completion.
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/tasks.py”, line 339, in __wakeup
| future.result()
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 203, in result
| raise self._exception.with_traceback(self._exception_tb)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/transports.py”, line 89, in do_open
| transport.open()
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/transports/plugins/ssh.py”, line 516, in open
| self._client.connect(self._machine, **connection_arguments)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 358, in connect
| retry_on_signal(lambda: sock.connect(addr))
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/util.py”, line 279, in retry_on_signal
| return function()
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 358, in
| retry_on_signal(lambda: sock.connect(addr))
| ^^^^^^^^^^^^^^^^^^
| TimeoutError: [Errno 60] Operation timed out
±> ERROR at 2024-01-15 04:07:13.777695-05:00
| Traceback (most recent call last):
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/utils.py”, line 187, in exponential_backoff_retry
| result = await coro()
| ^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/tasks.py”, line 193, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/utils.py”, line 94, in with_interrupt
| result = await next(wait_iter)
| ^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/tasks.py”, line 605, in _wait_for_one
| return f.result() # May raise f.exception().
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 203, in result
| raise self._exception.with_traceback(self._exception_tb)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/manager.py”, line 132, in _update_job_info
| self._jobs_cache = await self._get_jobs_from_scheduler()
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/manager.py”, line 98, in _get_jobs_from_scheduler
| transport = await request
| ^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 287, in await
| yield self # This tells Task to wait for completion.
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/tasks.py”, line 339, in __wakeup
| future.result()
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 203, in result
| raise self._exception.with_traceback(self._exception_tb)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/transports.py”, line 89, in do_open
| transport.open()
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/transports/plugins/ssh.py”, line 516, in open
| self._client.connect(self._machine, **connection_arguments)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 358, in connect
| retry_on_signal(lambda: sock.connect(addr))
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/util.py”, line 279, in retry_on_signal
| return function()
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/paramiko/client.py”, line 358, in
| retry_on_signal(lambda: sock.connect(addr))
| ^^^^^^^^^^^^^^^^^^
| TimeoutError: [Errno 60] Operation timed out
±> WARNING at 2024-01-15 04:07:13.787168-05:00
| maximum attempts 5 of calling do_update, exceeded
±> ERROR at 2024-01-16 02:31:40.637042-05:00
| Traceback (most recent call last):
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/utils.py”, line 187, in exponential_backoff_retry
| result = await coro()
| ^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/tasks.py”, line 193, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/utils.py”, line 94, in with_interrupt
| result = await next(wait_iter)
| ^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/tasks.py”, line 605, in _wait_for_one
| return f.result() # May raise f.exception().
| ^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/asyncio/futures.py”, line 203, in result
| raise self._exception.with_traceback(self._exception_tb)
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/manager.py”, line 132, in _update_job_info
| self._jobs_cache = await self._get_jobs_from_scheduler()
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/engine/processes/calcjobs/manager.py”, line 109, in _get_jobs_from_scheduler
| scheduler_response = scheduler.get_jobs(**kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/schedulers/scheduler.py”, line 361, in get_jobs
| joblist = self._parse_joblist_output(retval, stdout, stderr)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File “/Users/chemlab/anaconda3/envs/aiida3/lib/python3.11/site-packages/aiida/schedulers/plugins/slurm.py”, line 476, in _parse_joblist_output
| raise SchedulerError(
| aiida.schedulers.scheduler.SchedulerError: squeue returned exit code 1 (_parse_joblist_output function)
| stdout=‘’
| stderr=‘slurm_load_jobs error: Unable to contact slurm controller (connect failure)’

The last line contains the hint to the problem

| raise SchedulerError(
| aiida.schedulers.scheduler.SchedulerError: squeue returned exit code 1 (_parse_joblist_output function)
| stdout=‘’
| stderr=‘slurm_load_jobs error: Unable to contact slurm controller (connect failure)’

The engine tries to update the status of the job, and so contacts SLURM (which is the scheduler you configured for your computer). This fails, however, and after a few retries, the engine pauses the job. This is done exactly because probably this is a transient problem, and can be resolved or will resolve itself automatically and is not to do with the job itself. Now the question is why your SLURM scheduler fails to respond. Is the scheduler on shared compute infrastructure? Are there many users running or running a lot of jobs? Perhaps there is a lot of load on the scheduler and so it cannot always respond. This won’t be the first time we have seen this with SLURM. I am afraid that ther is not much that AiiDA can do about this.

1 Like

Yeah it’s used by lots of people. Was just thinking maybe there was something that could be done on the AiiDA end configuring the computer. I only had 10 jobs running on it. Ah well, I’ll just add in some code to check for pauses.

Thanks!

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.