Odd socket timeout error

I’m getting a paramiko TimeoutError from the following block:

        while True:
            chunk_exists = False

            if stdout.channel.recv_ready():  # True means that the next .read call will at least receive 1 byte
                chunk_exists = True
                try:
                    piece = stdout.read(internal_bufsize)
                    stdout_bytes.append(piece)
                except socket.timeout:
                    # There was a timeout: I continue as there should still be data
                    pass

            if stderr.channel.recv_stderr_ready():  # True means that the next .read call will at least receive 1 byte
                chunk_exists = True
                try:
                    piece = stderr.read(internal_bufsize)
                    stderr_bytes.append(piece)
                except socket.timeout:
                    # There was a timeout: I continue as there should still be data
                    pass

            # If chunk_exists, there is data (either already read and put in the std*_bytes lists, or
            # still in the buffer because of a timeout). I need to loop.
            # Otherwise, there is no data in the buffers, and I enter this block.
            if not chunk_exists:
                # Both channels have no data in the buffer
                if channel.exit_status_ready():
                    # The remote execution is over

                    # I think that in some corner cases there might still be some data,
                    # in case the data arrived between the previous calls and this check.
                    # So we do a final read. Since the execution is over, I think all data is in the buffers,
                    # so we can just read the whole buffer without loops
                    stdout_bytes.append(stdout.read())
                    stderr_bytes.append(stderr.read())
                    # And we go out of the `while True` loop
                    break
                # The exit status is not ready:
                # I just put a small sleep to avoid infinite fast loops when data
                # is not available on a slow connection, and loop
                time.sleep(0.01)

        # I get the return code (blocking)
        # However, if I am here, the exit status is ready so this should be returning very quickly
        retval = channel.recv_exit_status()

…specifically from stdout_bytes.append(stdout.read()).

I’m a bit confused. My report does not have any output from stdout, so I suppose this error was raised early in the calculation. For this line to trigger, both not chunk_exists and channel.exit_status_ready() must be true - there is no data and “the remote execution over”? But the job is still running (producing data) on the remote.

@giovannipizzi this was last touched by you. Any ideas? Your comments above the error-triggering line suggest there might be some data in the buffer? Maybe this is such a corner case? If so, why was it decided to not socket.timeout-guard this reading?

I think I need a bit more info.
Do you have a full traceback? (Maybe you still find it in the daemon logs, inside your .aiida folder in the subfolder of the correct profile)

Also, is this error reproducible or it just happened randomly once? A TimeoutError could just be a connection error. If this is the case, did the corresponding calculation except, or it just used the exponential backoff mechanism to retry?

Note that this call is also used not only to parse the job output, but also every X seconds (X~30?) to check the output of the scheduler (the output of squeue just to be clear, if you are using SLURM, to decide if the job has finsihed).
I guess this is the case for you, since the job is still running. So hopefully AiiDA didn’t lose the job, just it missed one of the periodic checks of the scheduler status, but should pick up correctly later the status when the job actually finishes.

(One other option might be the the squeue command is just super slow in replying?)

It would be good to know where the error appeared, and if this caused some except state or just the error was logged, to understand if there is anything to improve in the error-catching mechanism (and if this can be reproduced)