bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#41625: [PATCH] offload: Handle a possible EOF response from read-rep


From: Ludovic Courtès
Subject: bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response.
Date: Tue, 25 May 2021 22:27:02 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Hi!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object, to allow for retrying a single machine.  Handle
> the case where the checks raised an exception due to the connection to the
> build machine having been lost, and retry up to 3 times.  Ensure the cleanup
> code is run in all situations.
> (check-machines-availability): New procedure.  Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
> * guix/inferior.scm (&inferior-connection-lost): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

[...]

> +(define-condition-type &inferior-connection-lost &error
> +  inferior-connection-lost?)
> +
>  (define* (read-repl-response port #:optional inferior)
>    "Read a (guix repl) response from PORT and return it as a Scheme object.
>  Raise '&inferior-exception' when an exception is read from PORT."
> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read 
> from PORT."
>    (match (read port)
>      (('values objects ...)
>       (apply values (map sexp->object objects)))
> +    ;; Unexpectedly read EOF from the port.  This can happen for example when
> +    ;; the underlying connection for PORT was lost with Guile-SSH.
> +    (? eof-object?
> +       (raise (condition (&inferior-connection-lost))))

The match clause syntax is incorrect; should be:

 ((? eof-object?)
  (raise …))

> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>            (length machines) machine-file)
> -    (let* ((names    (map build-machine-name machines))
> -           (sockets  (map build-machine-daemon-socket machines))
> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
> -           (nodes    (map remote-inferior sessions)))
> -      (for-each assert-node-has-guix nodes names)
> -      (for-each assert-node-repl nodes names)
> -      (for-each assert-node-can-import sessions nodes names sockets)
> -      (for-each assert-node-can-export sessions nodes names sockets)
> -      (for-each close-inferior nodes)
> -      (for-each disconnect! sessions))))
> +    (par-for-each check-machine-availability machines)))

Why not!  IMO this should go in a separate patch, though, since it’s not
related.

> +(define (check-machine-availability machine)
> +  "Check whether MACHINE is available.  Exit with an error upon failure."
> +  ;; Sometimes, the machine remote port may return EOF, presumably because 
> the
> +  ;; connection was lost.  Retry up to 3 times.
> +  (let loop ((retries 3))
> +    (guard (c ((inferior-connection-lost? c)
> +               (let ((retries-left (1- retries)))
> +                 (if (> retries-left 0)
> +                     (begin
> +                       (format (current-error-port)
> +                               (G_ "connection to machine ~s lost; 
> retrying~%")
> +                               (build-machine-name machine))
> +                       (loop (retries-left)))
> +                     (leave (G_ "connection repeatedly lost with machine 
> '~a'~%")
> +                            (build-machine-name machine))))))

I’m afraid we’re papering over problems here.

Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
berlin enough to reproduce the issue?  If so, we could monitor/strace
sshd on overdrive1 to get a better understanding of what’s going on.

WDYT?

Thanks,
Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]