guix-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[shepherd] 01/05: service: 'make-kill-destructor' waits for termination,


From: Ludovic Courtès
Subject: [shepherd] 01/05: service: 'make-kill-destructor' waits for termination, eventually SIGKILLs.
Date: Wed, 16 Nov 2022 18:00:58 -0500 (EST)

civodul pushed a commit to branch master
in repository shepherd.

commit d97592f58603ff51cb280ae57d413c8731e601b3
Author: Ludovic Courtès <ludo@gnu.org>
AuthorDate: Wed Nov 16 11:16:19 2022 +0100

    service: 'make-kill-destructor' waits for termination, eventually SIGKILLs.
    
    Fixes <https://issues.guix.gnu.org/58485>.
    Reported by Lars-Dominik Braun <lars@6xq.net>.
    
    Previously, the procedure returned by 'make-forkexec-constructor' was
    asynchronous: it did not wait for the process to actually terminate.
    This could lead to issues typically when running 'herd restart SVC',
    because the initial process might still be around by the time we start a
    new one.
    
    * modules/shepherd/service.scm (make-kill-destructor): Add #:grace-period.
    Use 'terminate-process' instead of 'kill'.
    (process-monitor): Add clause for 'await' request.  Adjust now that
    WAITERS can contain several waiters for a given PID.
    (default-process-termination-grace-period): New parameter.
    (get-message*, terminate-process): New procedures.
    (stop): When computing STOPPED-DEPENDENTS, arrange to call 'stop'
    outside of 'fold-services'.
    * tests/forking-service.sh: Add 'test4' service and exercise the grace
    period mechanism.
    * modules/shepherd.scm (handle-SIGINT): Wrap 'catch' expression in
    'spawn-fiber'.
    (signal-handler): Use 'handle-SIGINT' for SIGTERM and SIGHUP as well.
    * doc/shepherd.texi (Service De- and Constructors): Update
    'make-kill-destructor' documentation.  Document
    'default-process-termination-grace-period'.
---
 doc/shepherd.texi            |  26 ++++++-----
 modules/shepherd.scm         |  19 ++++----
 modules/shepherd/service.scm | 100 +++++++++++++++++++++++++++++++++++--------
 tests/forking-service.sh     |  32 +++++++++++---
 4 files changed, 135 insertions(+), 42 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 96daac8..d25a8f2 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -992,15 +992,12 @@ Guile's @code{setrlimit} procedure is applied on the 
entries in
 @end lisp
 @end deffn
 
-@deffn {procedure} make-kill-destructor [@var{signal}]
-Return a procedure that sends @var{signal} to the process group of the
-PID given as argument, where @var{signal} defaults to @code{SIGTERM}.
-
-This @emph{does} work together with respawning services,
-because in that case the @code{stop} method of the @code{<service>}
-class sets the @code{running} slot to @code{#f} before actually
-calling the destructor; if it would not do that, killing the process
-in the destructor would immediately respawn the service.
+@deffn {procedure} make-kill-destructor [@var{signal}] @
+  [#:grace-period (default-process-termination-grace-period)]
+Return a procedure that sends @var{signal} to the process group of the PID
+given as argument, where @var{signal} defaults to @code{SIGTERM}.  If the
+process is still running after @var{grace-period} seconds, send it
+@code{SIGKILL}.  The procedure returns once the process has terminated.
 @end deffn
 
 The @code{make-forkexec-constructor} procedure builds upon the following
@@ -1060,10 +1057,19 @@ returns when the program starts (@pxref{Runtime 
Environment,
 
 @defvr {Scheme Variable} default-pid-file-timeout
 This parameter (@pxref{Parameters,,, guile, GNU Guile Reference Manual})
-specified the default PID file timeout in seconds, when
+specifies the default PID file timeout in seconds, when
 @code{#:pid-file} is used (see above).  It defaults to 5 seconds.
 @end defvr
 
+@defvr {Scheme Variable} default-process-termination-grace-period
+This parameter (@pxref{Parameters,,, guile, GNU Guile Reference Manual})
+specifies the ``grace period'' (in seconds) after which a process that
+has been sent @code{SIGTERM} or some other signal to gracefully exit is
+sent @code{SIGKILL} for immediate termination.  It defaults to 5
+seconds.
+@end defvr
+
+
 @cindex on-demand, starting services
 @cindex inetd-style services
 One may also define services meant to be started @emph{on demand}.  In
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index dfa18f6..3e2e9ec 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -125,10 +125,15 @@ already ~a threads running, disabling 'signalfd' support")
 
 (define (handle-SIGINT)
   "Handle SIGINT by stopping the Shepherd, which means rebooting if we're PID 
1."
-  (catch 'quit
-    (lambda ()
-      (stop root-service))
-    quit-exception-handler))
+  ;; Since 'stop' is synchronous and may block until SIGCHLD has been received
+  ;; for the process it's waiting for, call it in a separate fiber so that
+  ;; signals are still being processed in the meantime.
+  (spawn-fiber
+   (lambda ()
+     (catch 'quit
+       (lambda ()
+         (stop root-service))
+       quit-exception-handler))))
 
 (define (signal-handler signal)
   "Return the signal handler for SIGNAL."
@@ -137,11 +142,7 @@ already ~a threads running, disabling 'signalfd' support")
         ((= signal SIGINT)
          (lambda _ (handle-SIGINT)))
         ((memv signal (list SIGTERM SIGHUP))
-         (lambda _
-           (catch 'quit
-             (lambda ()
-               (stop root-service))
-             quit-exception-handler)))
+         (lambda _ (handle-SIGINT)))
         (else
          (const #f))))
 
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 7779b8a..77f1078 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -27,7 +27,9 @@
   #:use-module ((fibers)
                 #:hide (sleep))
   #:use-module (fibers channels)
+  #:use-module (fibers operations)
   #:use-module (fibers scheduler)
+  #:use-module (fibers timers)
   #:use-module (oop goops)
   #:use-module (srfi srfi-1)
   #:use-module (srfi srfi-9)
@@ -89,6 +91,7 @@
             default-environment-variables
             make-forkexec-constructor
             make-kill-destructor
+            default-process-termination-grace-period
             exec-command
             fork+exec-command
             default-pid-file-timeout
@@ -467,13 +470,16 @@ is not already running, and will return SERVICE's 
canonical name in a list."
             (local-output (l10n "Service ~a pending to be stopped.")
                           (canonical-name service))
             (list (canonical-name service)))
-          (let ((name (canonical-name service))
-                (stopped-dependents (fold-services (lambda (other acc)
-                                                     (if (and (running? other)
-                                                              (required-by? 
service other))
-                                                         (append (stop other) 
acc)
-                                                         acc))
-                                                   '())))
+          (let* ((name (canonical-name service))
+                 (dependents (fold-services (lambda (other lst)
+                                              (if (and (running? other)
+                                                       (required-by? service 
other))
+                                                  (cons other lst)
+                                                  lst))
+                                            '()))
+                 ;; Note: 'fold-services' introduces a continuation barrier,
+                 ;; which is why we're not using it when calling 'stop'.
+                 (stopped-dependents (append-map stop dependents)))
             ;; Stop the service itself.
             (catch #t
               (lambda ()
@@ -1242,9 +1248,13 @@ start."
              pid))
           pid))))
 
-(define* (make-kill-destructor #:optional (signal SIGTERM))
-  "Return a procedure that sends SIGNAL to the process group of the PID given
-as argument, where SIGNAL defaults to `SIGTERM'."
+(define* (make-kill-destructor #:optional (signal SIGTERM)
+                               #:key (grace-period
+                                      
(default-process-termination-grace-period)))
+  "Return a procedure that sends @var{signal} to the process group of the PID
+given as argument, where @var{signal} defaults to @code{SIGTERM}.  If the
+process is still running after @var{grace-period} seconds, send it
+@code{SIGKILL}.  The procedure returns once the process has terminated."
   (lambda (pid . args)
     ;; Kill the whole process group PID belongs to.  Don't assume that PID is
     ;; a process group ID: that's not the case when using #:pid-file, where
@@ -1253,8 +1263,10 @@ as argument, where SIGNAL defaults to `SIGTERM'."
     ;; will still be zero (the Shepherd PGID). In that case, use the PID.
     (let ((pgid (getpgid pid)))
       (if (= (getpgid 0) pgid)
-          (kill pid signal) ;don't kill ourself
-          (kill (- pgid) signal)))
+          (terminate-process pid signal           ;don't kill ourself
+                             #:grace-period grace-period)
+          (terminate-process (- pgid) signal
+                             #:grace-period grace-period)))
     #f))
 
 ;; Produce a constructor that executes a command.
@@ -1811,21 +1823,34 @@ otherwise by updating its state."
           (handle-service-termination service status)))
 
        ;; Notify any waiters.
-       (match (vhash-assv pid waiters)
-         (#f #f)
-         ((_ . waiter)
-          (put-message waiter status)))
+       (vhash-foldv* (lambda (waiter _)
+                       (put-message waiter status)
+                       #t)
+                     #t pid waiters)
 
        ;; XXX: The call below is linear in the size of WAITERS, but WAITERS is
        ;; usually empty or small.
-       (loop (vhash-delv pid waiters)))
+       (loop (vhash-fold (lambda (key value result)
+                           (if (= key pid)
+                               result
+                               (vhash-consv key value result)))
+                         vlist-null
+                         waiters)))
 
       (('spawn command reply)
        ;; Spawn COMMAND; send its exit status to REPLY when it terminates.
        ;; This operation is atomic: the WAITERS table is updated before
        ;; termination of PID can possibly be handled.
        (let ((pid (fork+exec-command command)))
-         (loop (vhash-consv pid reply waiters)))))))
+         (loop (vhash-consv pid reply waiters))))
+
+      (('await pid reply)
+       ;; Await the termination of PID and send its status on REPLY.
+       (if (catch-system-error (kill pid 0))
+           (loop (vhash-consv pid reply waiters))
+           (begin                                 ;PID is gone
+             (put-message reply 0)
+             (loop waiters)))))))
 
 (define (spawn-process-monitor)
   "Spawn a process monitoring fiber and return a channel to communicate with
@@ -1863,6 +1888,45 @@ context.  The process monitoring fiber is responsible 
for handling
         (get-message reply))
       (apply system* program arguments)))
 
+(define default-process-termination-grace-period
+  ;; Default process termination "grace period" before we send SIGKILL.
+  (make-parameter 5))
+
+(define* (get-message* channel timeout #:optional default)
+  "Receive a message from @var{channel} and return it, or, if the message 
hasn't
+arrived before @var{timeout} seconds, return @var{default}."
+  (call-with-values
+      (lambda ()
+        (perform-operation
+         (choice-operation (get-operation channel)
+                           (sleep-operation timeout))))
+    (match-lambda*
+      (()                               ;'sleep' operation returns zero values
+       default)
+      ((message)                            ;'get' operation returns one value
+       message))))
+
+(define* (terminate-process pid signal
+                            #:key (grace-period
+                                   (default-process-termination-grace-period)))
+  "Send @var{signal} to @var{pid}, which can be negative to denote a process
+group; wait for @var{pid} to terminate and return its exit status.  If
+@var{pid} is still running @var{grace-period} seconds after @var{signal} has
+been sent, send it @code{SIGKILL}."
+  (let ((reply (make-channel)))
+    (put-message (current-process-monitor) `(await ,(abs pid) ,reply))
+    (kill pid signal)
+
+    (match (get-message* reply grace-period #f)
+      (#f
+       (local-output
+        (l10n "Grace period of ~a seconds is over; sending ~a SIGKILL.")
+        grace-period pid)
+       (catch-system-error (kill pid SIGKILL))
+       (get-message reply))
+      (status
+       status))))
+
 (define (handle-service-termination service status)
   "Handle the termination of the process associated with @var{service}, whose
 PID is in its @code{running} slot; @var{status} is the process's exit status
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 08a67ad..ddf9f3e 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -1,5 +1,5 @@
 # GNU Shepherd --- Test detecting a forked process' termination
-# Copyright © 2016, 2020 Ludovic Courtès <ludo@gnu.org>
+# Copyright © 2016, 2020, 2022 Ludovic Courtès <ludo@gnu.org>
 # Copyright © 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 #
 # This file is part of the GNU Shepherd.
@@ -84,6 +84,17 @@ cat > "$conf"<<EOF
    #:start (make-forkexec-constructor %command3)
    #:stop  (make-kill-destructor)
    #:respawn? #t))
+
+(define %command4
+  '("$SHELL" "-c" "trap 'echo ignoring SIGTERM' SIGTERM; while true ; do : ; 
done"))
+
+(register-services
+ (make <service>
+   ;; A service that ignores SIGTERM.
+   #:provides '(test4)
+   #:requires '(test3)
+   #:start (make-forkexec-constructor %command4)
+   #:stop  (make-kill-destructor SIGTERM #:grace-period 3)))
 EOF
 cat $conf
 
@@ -134,11 +145,22 @@ test -f "$service_nofiles"
 nofiles_value="`cat $service_nofiles`"
 test 1567 -eq $nofiles_value
 
-
-
-# Try to trigger eventual race conditions, when killing a process between fork
-# and execv calls.
+# Try to trigger potential race conditions, when killing a process, between
+# the fork and execv calls.
 for i in `seq 1 50`
 do
     $herd restart test3
 done
+
+# Make sure 'herd stop' eventually terminates processes that ignore SIGTERM.
+$herd start test4
+$herd status test3 | grep started
+child_pid="$($herd status test4 | grep Running | sed '-es/.*Running value is 
\([0-9]\+\)\./\1/g')"
+kill -0 "$child_pid"
+$herd stop test3               # this will also stop 'test4'
+! kill -0 "$child_pid"
+grep ignoring "$log"
+grep SIGKILL "$log"
+$herd status test3 | grep stopped
+
+$herd status



reply via email to

[Prev in Thread] Current Thread [Next in Thread]