bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Async processes started in functions not reliably started


From: Steffen Nurpmeso
Subject: Re: Async processes started in functions not reliably started
Date: Tue, 06 Aug 2019 16:25:27 +0200
User-agent: s-nail v14.9.14-9-g0a0ff75e

Hello Mr. Elz.

Ah, that old story, you are very much welcome.

Robert Elz wrote in <26245.1565045376@jinx.noi.kre.to>:
 |    Date:        Mon, 05 Aug 2019 14:05:43 +0200
 |    From:        Steffen Nurpmeso <steffen@sdaoden.eu>
 |    Message-ID:  <20190805120543.Bf9-U%steffen@sdaoden.eu>
 |
 || Would be nice to have some shell support for signalling the parent
 || that the child is now functional,
 |
 |The shell cannot really know - your example was not functional until
 |after it set up the traps.

That was the problem.

 |But the shell code knows, something like the following might work
 |(untested, not even given off to bash to check syntax, and uses of $?
 |would need to be sanitised (value saved) with just what is here it is
 |OK, but real code to replace the comments would probably need to use it
 |again)
 |
 |In the parent:
 |
 | OK=false
 | T=$(trap -p USR2)            # only needed if USR2 might be trapped \
 | already
 | trap 'OK=true' USR2
 |
 | run_the_child &
 | if ! $OK && wait $!

That unary does not work on all shells.  I think SunOS 5.9 is
a problem, i use "if x; then :; else" to overcome that.

 | then
 |  echo "Child failed to initialise properly! >&2
 |  # and whatever else you want to do
 | elif $OK
 | then
 |: # here the child is running, and ready
 | else
 |  echo "Failure: $? from child" >& 2
 |
 |  # either the child did exit N (N != 0) in which
 |  # case $? will tell us why it failed, or some
 |  # stray signal was delivered (and caught) by the
 |  # current shell ... deal with those possibilities
 | fi
 | case "$T" in # if T= was needed above
 | '') trap - USR2;;    # bash would have said nothing if trap was default
 |  *) eval "$T"         ;; # for other shells which do, or if USR2 was \
 |  trapped.
 | esac
 | # continue with parent code, now knowing that child has init'd itself
 |
 |In the child:
 |
 | trap 'whatever' SIG_I_NEED
 | # any other init that is needed
 |
 | kill -s USR2 $$      # or if the parent pid is not $$, use whatever is.

Detection of parent pid in subshells is actually really a problem
that i always run into.

 | # do whatever the child is supposed to do
 |
 |The wait is to pause the parent - an exit 0 from it should not happen,
 |and indicates that the child did exit 0 which it is not supposed to do
 |at this point.  The ! $OK test before the wait is in case the child
 |started very quickly, and the signal already arrived.   There is still
 |a race condition here (having the child sleep for a brief interval as
 |part of its init would help reduce the probability of problems from that).
 |Pity the shell has no way to allow scripts to block signals (ie: sigblock).
 |
 |If the wait is interrupted by a signal, (or if the USR2 signal happened
 |earlier and we skip the wait) and it was USR2 (from the child) then OK
 |will become true, and the child is ready to continue.   If the wait
 |exits for some other reason, then perhaps some other signal was delivered,
 |and caught, and did not exit the shell) - if that's possible the wait \
 |should
 |be in a loop (ie: while :; do if wait ...) and this case should cause the
 |loop to iterate, whereas all the other possibilities end in break, or the
 |child did exit N indicating that some failure happened before it init'd
 |itself.
 |
 |No temp files, named pipes, or othe similar stateful mechanisms needed.
 |What's more, aside from the "trap -p" which is probably not going to be
 |needed (the script writer knows no other USR2 trap is already set) all of
 |this is POSIX code (even the trap -p will be in the next version).

It is not portable until then.  (My daily mksh does not have it.)
A very fine idea, using this signal mess for something good!

 |kre
 |
 |ps: the function in the example is badly named, to "reap" is to harvest
 |or collect, what the function given that name is actually doing is
 |killing other processes (the original parent collects them, not that
 |child) - a better name would be assassin than reaper (it isn't even the
 |"Grim Reaper").

Grim Reaper is, actually not, nice.  Well, hm, badly influenced by
the Bauhaus song The Sanity Assassin, in the 80s, likely.  Also,
in (my) German, Die Assassinen is no good.  You know, it tends to
become religious here, and _i_ played that reap the muslims (via
machine gun) Commodore 64 game back in the 80s, where the high
score was formed by cutting of muslim heads (in a sequence, for
each letter, which took looooong), actually.  If anyone knows the
name of that game (i did not buy it, "it just came along", just
like Elite, Pirates or Defender of the Crown, from school friends,
i really would like to know!).  But not very often, and only for
a very short time, it was very boring.  Killing an Arab, like The
Cure sang even earlier, but more peacefully, i think.

No, dear Mr. Elz, assassin it cannot be.  The code in question was
also about undertaking tests which loops endlessly, for example,
without disturbing the test run as such (so an outer parachute is
not it).  But the idea of using a signal-based callback is really
cut, thanks for that idea!  nonetheless, the code now has been
verified to work under SunOS 5.9 - .11, *BSD, and Linux, and that
is good enough for now.

It still has a problem that will increase over time when
i remember Matthew Dillon's post on DragonFly BSD users@[1], where
he claims 450000 execs per second for a statically linked binary,
and about 45000 execs per second for a dynamic one, with DragonFly
5.6 on a threadripper.

  [1] https://marc.info/?l=dragonfly-users&m=155846667020624&w=2

  jobreaper_start() {
     (
        sleeper= int= hot=
        trap '
           [ -n "${sleeper}" ] && kill -KILL ${sleeper}
           int=1 hot=1
        ' USR1
        trap '
           [ -n "${sleeper}" ] && kill -KILL ${sleeper}
           int=1 hot=
        ' USR2
        trap '
           [ -n "${sleeper}" ] kill -KILL ${sleeper}
           echo "Stopping job reaper"
           exit 0
        ' TERM
        trap '' EXIT
  
        # traps are setup, notify parent that we are up and running
        echo > t.jobreaper
  
        while [ 1 ]; do
           int=
           sleep ${JOBWAIT} &
           sleeper=${!}
           wait
           sleeper=
           if [ -z "${int}" ] && [ -n "${hot}" ]; then
              i=0 l=
              while [ ${i} -lt ${MAXJOBS} ]; do
                 i=`add ${i} 1`
  
                 if [ -s t.${i}.pid ] && read p n < t.${i}.pid; then
                    # Of course a race condition, but cannot be helped!
                    if [ -s t.${i}.pid ]; then

The test subshells will be spawned like

   (
      if ${mkdir} t.${JOBS}.d; then
         cd t.${JOBS}.d
         eval t_${1} ${JOBS} ${1}
      fi
      ${rm} -f ../t.${JOBS}.pid
   ) > t.${JOBS}.io 2>&1 </dev/null &
   JOBLIST="${JOBLIST} ${!}"
   printf "${!} ${1}\n" > t.${JOBS}.pid

So we race on that t.*.pid file

                       kill -KILL ${p}

Here ^

                       ${rm} -f t.${i}.result
                       l="${l} ${i}/${n}"
                    fi
                 fi
              done
              [ -n "${l}" ] &&
                 printf '%s!! Reaped job(s)%s after %s seconds%s\n' \
                    "${COLOR_ERR_ON}" "${l}" ${JOBWAIT} "${COLOR_ERR_OFF}"
           fi
        done
     ) </dev/null & #>/dev/null 2>&1 &
     JOBREAPER=${!}
  
     while [ 1 ]; do
        [ -f t.jobreaper ] && break
        printf '.. waiting for job reaper to come up\n'
        sleep 1
     done
     ${rm} t.jobreaper
  }
  
  jobreaper_stop() {
     [ -n "${JOBREAPER}" ] && kill -TERM ${JOBREAPER}
     JOBREAPER=
  }


My guess would have been that if i kill(1) a job specification,
then the sh(1)ell would refuse to use its builtin kill(1) to kill
a PID that was saved away from ${!}, and the process in question
has already terminated.  Of course using job specifications is
impossible here, let alone portably (though that i have not even
tried, because set -m results in a lot of unwanted noise).
Maybe, if i think some more, maybe i will find a better solution
for that.  For now i kept it like above.
Mr. Elz,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]