bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#57922: Shepherd doesn't seem to correctly handle waitpid itself


From: Josselin Poiret
Subject: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself
Date: Tue, 20 Sep 2022 09:31:57 +0200

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

> Hi,
>
> I've tried to determine why a workaround in the jami-service-type is
> required in the 'stop' slot to avoid failures in 'herd restart jami',
> and haven't quite found the culprit, but it appears to me that:
>
> 1. waipid is only called in one place in Shepherd, which is in the
> handle-SIGCHLD procedure in (shepherd service), which does not
> specifically wait for an exact PID but rather does:
>
> (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
> in the case a system-error exception is thrown with an ECHILD or EINTR
> error number.
>
> This doesn't strike me as a strong guarantee that waitpid occurs when
> stop is called, because:
>
> 1. It requires to be installed in the signal handlers for each
> processes, with something like:
>
> --8<---------------cut here---------------start------------->8---
>   (unless %sigchld-handler-installed?
>     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
>     (set! %sigchld-handler-installed? #t))
> --8<---------------cut here---------------end--------------->8---
>
> Done for fork+exec-command and make-inetd-forkexec-constructor, but not
> for make-forkexec-constructor/container, AFAICT;

The signal handler is only installed once in PID 1 (in fact, you haven't
forked yet here), since it's the one that receives the SIGCHLD.

What I don't understand that well is that this signal handler could be
installed only once when shepherd starts, right?  That way, it wouldn't
need to depend on specific start actions being chosen.

> 2. it has the WNOHANG flag, which means the stop simply does a kill the
> the signal handling weakly (because of WNOHANG) waits on it, which means
> the start may begin before the process was actually completely
> terminated.
>
> Here's a small reproducer to apply on our code base:
>
> --8<---------------cut here---------------start------------->8---
> modified   gnu/services/telephony.scm
> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>  
>                      ;; Finally, return the PID of the daemon process.
>                      daemon-pid))
> -               (stop
> -                #~(lambda (pid . args)
> -                    (kill pid SIGKILL)
> -                    ;; Wait for the process to exit; this prevents 
> overlapping
> -                    ;; processes when issuing 'herd restart'.
> -                    (waitpid pid)
> -                    #f))))))))
> +               (stop #~(make-kill-destructor))))))))
>  
>  (define jami-service-type
>    (service-type
> --8<---------------cut here---------------end--------------->8---

The real problem here is not really the WNOHANG flag (you could remove
that and still get issues) but rather that the waitpid is run inside a
signal handler, which in Guile means that it's run through asyncs.  You
have no guarantees wrt. when asyncs run, so they could run after or in
the middle of the next action.  I also think make-kill-destructor should
waitpid the processes it's killing, as you're implying, and leave the
signal handler only for unexpected service crashes.

Best,
-- 
Josselin Poiret





reply via email to

[Prev in Thread] Current Thread [Next in Thread]