[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Async processes started in functions not reliably started
From: |
Steffen Nurpmeso |
Subject: |
Re: Async processes started in functions not reliably started |
Date: |
Sun, 11 Aug 2019 00:50:44 +0200 |
User-agent: |
s-nail v14.9.14-9-g0a0ff75e |
Hello and a nice Saturday evening, Mr. Elz, and everyone.
While it is not a bash bug, and therefore quite off topic, i come
back to this once more. Maybe it is of interest for someone.
And maybe someone can shed some light on this. This would be
nice.
Steffen Nurpmeso wrote in <20190807193402.d1ZQM%steffen@sdaoden.eu>:
|Steffen Nurpmeso wrote in <20190806142527.9HS0i%steffen@sdaoden.eu>:
||Robert Elz wrote in <26245.1565045376@jinx.noi.kre.to>:
||| Date: Mon, 05 Aug 2019 14:05:43 +0200
||| From: Steffen Nurpmeso <steffen@sdaoden.eu>
||| Message-ID: <20190805120543.Bf9-U%steffen@sdaoden.eu>
| ..
|||The shell cannot really know - your example was not functional until
|||after it set up the traps.
| ..
|||No temp files, named pipes, or othe similar stateful mechanisms needed.
|
|Sorry for all that noise once again, but i have then rewritten it
|using mkfifo etc. with credits for some of you (which collects
|things i have seen flying by since Saturday night):
|
| They also came up with the solution: do not wait(1) on child
| processes until we know about their state, so that anytime before we
| actually do wait(1) we can safely kill(1) them (Jilles Tjoelker).
| Thus, let's create a FIFO (Chet Ramey) to get a synchronized
| device, strip the wild test undertaker to a core that only writes
| "timeout" to that FIFO, and also improve its startup-is-completed to
| simply send a signal to the parent process (Robert Elz). So
| either the tests finish nicely, in which case they write their job
| number to the fifo, or we see "timeout" and kill all remains.
...
The problem is that it does not work out portably. Maybe i am
getting something wrong, but i see failures on multi processor
OpenBSD 6.5/i386 and FreeBSD 11.3-RC2/i386 (in a Linux KVM/Qemu).
On these i see
mx-test.sh[8467]: can't open t.fifo: Interrupted system call
quite frequently, even if there are no traps installed at all, and
data written to the FIFO is occasionally lost. It is written in
(
trap '' HUP INT TERM EXIT
if ${mkdir} t.${JOBS}.d; then
( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} )
fi
[ -e t.fifo ] && echo ${JOBS} >> t.fifo
) > t.${JOBS}.io 2>&1 </dev/null &
and i can put it in an if.fi and see that echo has happened, with
a successful $?. But in the parent loop
while [ 1 ]; do
read js < t.fifo
# I saw quite frequest "Interrupted system call" errors on FreeBSD!
Also OpenBSD
[ ${?} -ne 0 ] && continue
it will never be read! I.e., whereas the test is an actual
success and exits fine we end up with
... [1=digmsg] [2=on_main_loop_tick] [3=compose_hooks] [4=mass_recipients] ..
waiting
...mx-test.sh: cannot open t.fifo: Interrupted system call
!! Timeout: reaped job(s) 2/[on_main_loop_tick]
but also like this:
... [1=q_t_etc_opts] [2=message_injections] [3=attachments] [4=rfc2231] ..
waiting
!! Timeout: reaped job(s) 1/[q_t_etc_opts]
This does never happen on Linux (x86-64). So then i have to make
the tests repeatedly write to the FIFO, and kill(1) them when the
parent really gets to read it (and kill(1) them hard if we read
the "timeout"), as in:
(
trap '' HUP INT TERM EXIT
if ${mkdir} t.${JOBS}.d; then
( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} )
fi
trap 'exit 0' USR1
while [ -e t.fifo ]; do
echo >&2 JOB $JOBS WRITES FIFO
echo ${JOBS} >> t.fifo
sleep 1
done
) > t.${JOBS}.io </dev/null & # 2>&1 </dev/null &
as well as
while [ 1 ]; do
read js < t.fifo
echo >&2 FROM FIFO I READ $js
[ ${?} -ne 0 ] && continue
JOBDESC=`${awk} -v L="${JOBDESC}" '
BEGIN{
while(1){
sub("^[ ]+", "", L)
sub("[ ]+$", "", L)
if(length(L) == 0)
break
x = L
sub("[ ]+.+$", "", x)
y = z = x
sub("^[0-9]+=[0-9]+/", "", z)
sub("/.+$", "", y)
x = y
sub("=.+", "", x)
sub(".+=", "", y)
print x " " y " " z
sub("^[^ ]+", "", L)
}
}
' | {
l= kl=
while read j p n; do
if [ ${js} = timeout ]; then
kl="${kl} ${j}/[${n}]"
echo >&2 KILL ING $j=$p/$n
kill -KILL ${p} >/dev/null 2>&1
${rm} -f t.${j}.result
elif [ ${js} = ${j} ]; then
echo >&2 USR1 ING $j=$p/$n
kill -USR1 ${p} >/dev/null 2>&1
else
l="${l} ${j}=${p}/${n}"
fi
done
if [ ${js} = timeout ] && [ -n "${kl}" ]; then
printf >&2 '%s!! Timeout: reaped job(s)%s%s\n' \
"${COLOR_ERR_ON}" "${kl}" "${COLOR_ERR_OFF}"
fi
echo ${l}
}`
[ ${js} = timeout ] && break
# If all jobs finished regulary: done
[ -z "${JOBDESC}" ] && break
done
But, even then, see this:
... [1=X_Y_opt_input_go_stack] [2=X_errexit] [3=Y_errexit] [4=S_freeze] ..
waiting
JOB 3 WRITES FIFO
FROM FIFO I READ 3
USR1 ING 3=8203/Y_errexit
JOB 4 WRITES FIFO
JOB 2 WRITES FIFO
FROM FIFO I READ 4
USR1 ING 4=8210/S_freeze
JOB 1 WRITES FIFO
FROM FIFO I READ 1
USR1 ING 1=8189/X_Y_opt_input_go_stack
...mx-test.sh[8470]: can't open t.fifo: Interrupted system call
...mx-test.sh[8470]: can't open t.fifo: Interrupted system call
...mx-test.sh[8470]: can't open t.fifo: Interrupted system call
FROM FIFO I READ timeout
KILL ING 2=8195/X_errexit
So then i do
(
trap '' HUP INT TERM EXIT
if ${mkdir} t.${JOBS}.d; then
( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} )
fi
if [ -n "${JOBREAPER}" ]; then
trap 'exit 0' USR1
while [ 1 ]; do
echo >&2 JOB $JOBS WRITES FIFO
echo ${JOBS} >> t.fifo
sleep 3
done
fi
) > t.${JOBS}.io </dev/null & # 2>&1 </dev/null &
And with that, finally, i get
... [1=alias] [2=charsetalias] [3=shortcut] [4=expandaddr]
.. waiting
JOB 2 WRITES FIFO
JOB 3 WRITES FIFO
FROM FIFO I READ 3
The 2 is not there!!
USR1 ING 3=20540/shortcut
JOB 1 WRITES FIFO
FROM FIFO I READ 1
USR1 ING 1=20526/alias
JOB 4 WRITES FIFO
FROM FIFO I READ 4
USR1 ING 4=20549/expandaddr
JOB 2 WRITES FIFO
FROM FIFO I READ 2
USR1 ING 2=20532/charsetalias
But, after a dozen tests, and with reducing the sleep to 1 (and
reducing the debug echoes):
... [1=ifelse] [2=localopts] [3=local] [4=environ] .. waiting
JOB 3 WRITES FIFO
JOB 2 WRITES FIFO
JOB 4 WRITES FIFO
JOB 1 WRITES FIFO
/usr/home/steffen/src/nail.git/mx-test.sh[8471]: can't open t.fifo:
Interrupted system call
/usr/home/steffen/src/nail.git/mx-test.sh[8471]: can't open t.fifo:
Interrupted system call
/usr/home/steffen/src/nail.git/mx-test.sh[8471]: can't open t.fifo:
Interrupted system call
!! Timeout: reaped job(s) 3/[local]
It does not loop! So i have extended to sleep to 3 again, and
placed the echo in a subshell. Other than that i offer a "testnj"
make target. I am entirely out of ideas.
A nice Sunday i wish.
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)