bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Async processes started in functions not reliably started


From: Steffen Nurpmeso
Subject: Re: Async processes started in functions not reliably started
Date: Sun, 11 Aug 2019 00:50:44 +0200
User-agent: s-nail v14.9.14-9-g0a0ff75e

Hello and a nice Saturday evening, Mr. Elz, and everyone.

While it is not a bash bug, and therefore quite off topic, i come
back to this once more.  Maybe it is of interest for someone.

And maybe someone can shed some light on this.  This would be
nice.

Steffen Nurpmeso wrote in <20190807193402.d1ZQM%steffen@sdaoden.eu>:
 |Steffen Nurpmeso wrote in <20190806142527.9HS0i%steffen@sdaoden.eu>:
 ||Robert Elz wrote in <26245.1565045376@jinx.noi.kre.to>:
 |||    Date:        Mon, 05 Aug 2019 14:05:43 +0200
 |||    From:        Steffen Nurpmeso <steffen@sdaoden.eu>
 |||    Message-ID:  <20190805120543.Bf9-U%steffen@sdaoden.eu>
 | ..
 |||The shell cannot really know - your example was not functional until
 |||after it set up the traps.
 | ..
 |||No temp files, named pipes, or othe similar stateful mechanisms needed.
 |
 |Sorry for all that noise once again, but i have then rewritten it
 |using mkfifo etc. with credits for some of you (which collects
 |things i have seen flying by since Saturday night):
 |
 |    They also came up with the solution: do not wait(1) on child
 |    processes until we know about their state, so that anytime before we
 |    actually do wait(1) we can safely kill(1) them (Jilles Tjoelker).
 |    Thus, let's create a FIFO (Chet Ramey) to get a synchronized
 |    device, strip the wild test undertaker to a core that only writes
 |    "timeout" to that FIFO, and also improve its startup-is-completed to
 |    simply send a signal to the parent process (Robert Elz).  So
 |    either the tests finish nicely, in which case they write their job
 |    number to the fifo, or we see "timeout" and kill all remains.
 ...

The problem is that it does not work out portably.  Maybe i am
getting something wrong, but i see failures on multi processor
OpenBSD 6.5/i386 and FreeBSD 11.3-RC2/i386 (in a Linux KVM/Qemu).
On these i see

  mx-test.sh[8467]: can't open t.fifo: Interrupted system call

quite frequently, even if there are no traps installed at all, and
data written to the FIFO is occasionally lost.  It is written in

   (
      trap '' HUP INT TERM EXIT
      if ${mkdir} t.${JOBS}.d; then
         ( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} )
      fi
      [ -e t.fifo ] && echo ${JOBS} >> t.fifo
   ) > t.${JOBS}.io 2>&1 </dev/null &

and i can put it in an if.fi and see that echo has happened, with
a successful $?.  But in the parent loop

      while [ 1 ]; do
         read js < t.fifo
         # I saw quite frequest "Interrupted system call" errors on FreeBSD!
Also OpenBSD
         [ ${?} -ne 0 ] && continue

it will never be read!  I.e., whereas the test is an actual
success and exits fine we end up with

  ... [1=digmsg] [2=on_main_loop_tick] [3=compose_hooks] [4=mass_recipients] .. 
waiting
  ...mx-test.sh: cannot open t.fifo: Interrupted system call
  !! Timeout: reaped job(s) 2/[on_main_loop_tick]

but also like this:

  ... [1=q_t_etc_opts] [2=message_injections] [3=attachments] [4=rfc2231] .. 
waiting
  !! Timeout: reaped job(s) 1/[q_t_etc_opts]

This does never happen on Linux (x86-64).  So then i have to make
the tests repeatedly write to the FIFO, and kill(1) them when the
parent really gets to read it (and kill(1) them hard if we read
the "timeout"), as in:

     (
        trap '' HUP INT TERM EXIT
        if ${mkdir} t.${JOBS}.d; then
           ( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} )
        fi
        trap 'exit 0' USR1
        while [ -e t.fifo ]; do
  echo >&2 JOB $JOBS WRITES FIFO
           echo ${JOBS} >> t.fifo
           sleep 1
        done
     ) > t.${JOBS}.io </dev/null & # 2>&1 </dev/null &

as well as

      while [ 1 ]; do
         read js < t.fifo
  echo >&2 FROM FIFO I READ $js
         [ ${?} -ne 0 ] && continue
         JOBDESC=`${awk} -v L="${JOBDESC}" '
            BEGIN{
               while(1){
                  sub("^[ ]+", "", L)
                  sub("[ ]+$", "", L)
                  if(length(L) == 0)
                     break

                  x = L
                  sub("[ ]+.+$", "", x)
                  y = z = x
                  sub("^[0-9]+=[0-9]+/", "", z)
                  sub("/.+$", "", y)
                  x = y
                  sub("=.+", "", x)
                  sub(".+=", "", y)
                  print x " " y " " z

                  sub("^[^ ]+", "", L)
               }
            }
         ' | {
            l= kl=
            while read j p n; do
               if [ ${js} = timeout ]; then
                  kl="${kl} ${j}/[${n}]"
  echo >&2 KILL ING $j=$p/$n
                  kill -KILL ${p} >/dev/null 2>&1
                  ${rm} -f t.${j}.result
               elif [ ${js} = ${j} ]; then
  echo >&2 USR1 ING $j=$p/$n
                  kill -USR1 ${p} >/dev/null 2>&1
               else
                  l="${l} ${j}=${p}/${n}"
               fi
            done

            if [ ${js} = timeout ] && [ -n "${kl}" ]; then
               printf >&2 '%s!! Timeout: reaped job(s)%s%s\n' \
                  "${COLOR_ERR_ON}" "${kl}" "${COLOR_ERR_OFF}"
            fi
            echo ${l}
         }`
         [ ${js} = timeout ] && break
         # If all jobs finished regulary: done
         [ -z "${JOBDESC}" ] && break
      done

But, even then, see this:

  ... [1=X_Y_opt_input_go_stack] [2=X_errexit] [3=Y_errexit] [4=S_freeze] .. 
waiting
  JOB 3 WRITES FIFO
  FROM FIFO I READ 3
  USR1 ING 3=8203/Y_errexit
  JOB 4 WRITES FIFO
  JOB 2 WRITES FIFO
  FROM FIFO I READ 4
  USR1 ING 4=8210/S_freeze
  JOB 1 WRITES FIFO
  FROM FIFO I READ 1
  USR1 ING 1=8189/X_Y_opt_input_go_stack
  ...mx-test.sh[8470]: can't open t.fifo: Interrupted system call
  ...mx-test.sh[8470]: can't open t.fifo: Interrupted system call
  ...mx-test.sh[8470]: can't open t.fifo: Interrupted system call
  FROM FIFO I READ timeout
  KILL ING 2=8195/X_errexit

So then i do

   (
      trap '' HUP INT TERM EXIT
      if ${mkdir} t.${JOBS}.d; then
         ( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} )
      fi
      if [ -n "${JOBREAPER}" ]; then
         trap 'exit 0' USR1
         while [ 1 ]; do
  echo >&2 JOB $JOBS WRITES FIFO
            echo ${JOBS} >> t.fifo
            sleep 3
         done
      fi
   ) > t.${JOBS}.io </dev/null & # 2>&1 </dev/null &

And with that, finally, i get

  ... [1=alias] [2=charsetalias] [3=shortcut] [4=expandaddr]
  .. waiting
  JOB 2 WRITES FIFO
  JOB 3 WRITES FIFO
  FROM FIFO I READ 3
The 2 is not there!!
  USR1 ING 3=20540/shortcut
  JOB 1 WRITES FIFO
  FROM FIFO I READ 1
  USR1 ING 1=20526/alias
  JOB 4 WRITES FIFO
  FROM FIFO I READ 4
  USR1 ING 4=20549/expandaddr
  JOB 2 WRITES FIFO
  FROM FIFO I READ 2
  USR1 ING 2=20532/charsetalias

But, after a dozen tests, and with reducing the sleep to 1 (and
reducing the debug echoes):

  ... [1=ifelse] [2=localopts] [3=local] [4=environ] .. waiting
  JOB 3 WRITES FIFO
  JOB 2 WRITES FIFO
  JOB 4 WRITES FIFO
  JOB 1 WRITES FIFO
  /usr/home/steffen/src/nail.git/mx-test.sh[8471]: can't open t.fifo: 
Interrupted system call
  /usr/home/steffen/src/nail.git/mx-test.sh[8471]: can't open t.fifo: 
Interrupted system call
  /usr/home/steffen/src/nail.git/mx-test.sh[8471]: can't open t.fifo: 
Interrupted system call
  !! Timeout: reaped job(s) 3/[local]

It does not loop!  So i have extended to sleep to 3 again, and
placed the echo in a subshell.  Other than that i offer a "testnj"
make target.  I am entirely out of ideas.

A nice Sunday i wish.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]