bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-


From: Maxim Cournoyer
Subject: bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
Date: Tue, 25 May 2021 23:18:17 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

>>  (define* (read-repl-response port #:optional inferior)
>>    "Read a (guix repl) response from PORT and return it as a Scheme object.
>>  Raise '&inferior-exception' when an exception is read from PORT."
>> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read 
>> from PORT."
>>    (match (read port)
>>      (('values objects ...)
>>       (apply values (map sexp->object objects)))
>> +    ;; Unexpectedly read EOF from the port.  This can happen for example 
>> when
>> +    ;; the underlying connection for PORT was lost with Guile-SSH.
>> +    (? eof-object?
>> +       (raise (condition (&inferior-connection-lost))))
>
> The match clause syntax is incorrect; should be:
>
>  ((? eof-object?)
>   (raise …))

Good catch, fixed.

>> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>            (length machines) machine-file)
>> -    (let* ((names    (map build-machine-name machines))
>> -           (sockets  (map build-machine-daemon-socket machines))
>> -           (sessions (map (cut open-ssh-session <> %short-timeout) 
>> machines))
>> -           (nodes    (map remote-inferior sessions)))
>> -      (for-each assert-node-has-guix nodes names)
>> -      (for-each assert-node-repl nodes names)
>> -      (for-each assert-node-can-import sessions nodes names sockets)
>> -      (for-each assert-node-can-export sessions nodes names sockets)
>> -      (for-each close-inferior nodes)
>> -      (for-each disconnect! sessions))))
>> +    (par-for-each check-machine-availability machines)))
>
> Why not!  IMO this should go in a separate patch, though, since it’s not
> related.

For me, it is related in that retrying all the checks of *every* build
offload machine would be too expensive; it already takes 32 s for my 4
offload machines; retrying this for up to 3 times would mean waiting for
a minute and half, which I don't find reasonable (imagine on berlin!).

>> +(define (check-machine-availability machine)
>> +  "Check whether MACHINE is available.  Exit with an error upon failure."
>> +  ;; Sometimes, the machine remote port may return EOF, presumably because 
>> the
>> +  ;; connection was lost.  Retry up to 3 times.
>> +  (let loop ((retries 3))
>> +    (guard (c ((inferior-connection-lost? c)
>> +               (let ((retries-left (1- retries)))
>> +                 (if (> retries-left 0)
>> +                     (begin
>> +                       (format (current-error-port)
>> +                               (G_ "connection to machine ~s lost; 
>> retrying~%")
>> +                               (build-machine-name machine))
>> +                       (loop (retries-left)))
>> +                     (leave (G_ "connection repeatedly lost with machine 
>> '~a'~%")
>> +                            (build-machine-name machine))))))
>
> I’m afraid we’re papering over problems here.

I had that thought too, but then also realized that even if this was
papering over a problem, it'd be a good one to paper over as this
problem can legitimately happen in practice, due to the network's
inherently shaky nature.  It seems better to be ready for it.  Also, my
hopes in being able to troubleshoot such a difficult to reproduce
networking issue are rather low.

> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
> berlin enough to reproduce the issue?  If so, we could monitor/strace
> sshd on overdrive1 to get a better understanding of what’s going on.

It's actually difficult to trigger it; it seems to happen mostly on the
first try after a long time without connecting to the machine; on the
2nd and later tries, everything is smooth.  Waiting a few minutes is not
enough to re-trigger the problem.

I've managed to see the problem a few lucky times with:

--8<---------------cut here---------------start------------->8---
while true; do guix offload test /etc/guix/machines.scm overdrive1; done
--8<---------------cut here---------------end--------------->8---

I don't have a password set for my user on overdrive1, so can't attach
strace to sshd, but yeah, we could try to capture it and see if we can
understand what's going on.

Attached is v2 of the patch, with the match clause fixed.

Attachment: 0001-offload-Handle-a-possible-EOF-response-from-read-rep.patch
Description: Text Data

Thanks!

Maxim

reply via email to

[Prev in Thread] Current Thread [Next in Thread]