[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing

From: Jacob Bachmeyer
Subject: Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang
Date: Wed, 20 May 2020 21:14:46 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20090807 MultiZilla/ SeaMonkey/1.1.17 Mnenhy/

Maciej W. Rozycki wrote:

So the solution to the problem is twofold. First pending force-kills are not killed after `wait' if there are more than one PID in the list passed to `close_wait_program'. This follows the observation that if there was only one PID on the list, then the process must have been created directly by `spawn' rather than by assigning a spawn ID to a pipeline and the return from `wait' would mean the process associated with the PID must have already been cleaned up after, so it is only when there are more there is a possibility any may have been left behind live.

Why are some cases assigning a spawn ID to a pipeline and others directly using spawn?

Second if a pipeline has been used, then the channel associated with the pipeline is set to the nonblocking mode in case any of the processes that may have left live is stuck in the noninterruptible sleep (aka D) state. Such a process would necessarily ignore even SIGKILL so long as it remains in that state and would cause wait(2) called by `close' to hang possibly indefinitely, and we want the testsuite to proceed rather than hang even in bad circumstances.

This amounts to leaking PIDs but should be tolerable since the condition causing it is abnormal and external to the testsuite.

Finally it appears to be safe to leave pending force-kills to complete their job after `wait' has been called in `close_wait_program', because based on the observation made here the command does not actually call wait(2) if issued on a spawn ID associated with a pipeline created by `open' rather than a process created by `spawn'. Instead the PIDs from a pipeline are supposed to be cleaned up after by calling wait(2) from the `close' command call made on the pipeline channel. If on the other hand the channel is set to the nonblocking mode before `close', then even that command does not call wait(2) on the associated PIDs.

Therefore the PIDs on the list passed are not subject to PID reuse and the force-kills won't accidentally kill an unrelated process, as a PID cannot be allocated by the kernel for a new process until any previous process's status has been consumed from its PID by wait(2). And then PIDs of any children that have actually terminated one way or another are wait(2)ed for by Tcl automatically, so no mess is left behind.

The above two paragraphs need to be in the Git commit message if this patch is merged; they describe assumptions critical to correct operation that I do not believe are the documented behavior of Tcl.

* lib/remote.exp (close_wait_program): Only kill the pending force-kills if the PID list has a single entry. (local_exec): If we didn't see an EOF, then set the channel about to be closed to the nonblocking mode.

Signed-off-by: Maciej W. Rozycki <address@hidden>

I have observed the hang in actual testing, where I ran full GCC testing (including all language front-ends and all target libraries) remotely for the `riscv64-linux-gnu' target with QEMU in the system emulation mode. I left it running over the UK long weekend of May 8th-10th and upon return found the testsuite stuck for several days. The underlying reason was an unindentified issue with QEMU or the Linux installation within causing the emulator's virtual network interface to go down. However the testsuite ought not to have got stuck indefinitely on the host system. The process state on the host was as shown above, with `ssh' still live and both `cat' and `sh' in the zombie state.

I have since reproduced the issue with a shell script substituting `ssh' with a command that reliably hangs, which let me track down the cause and diagnose it as above. This fix then let GCC testing run to completion, despite intermittent issues with QEMU remaining and causing occasional time-outs.

 Therefore, please apply.  FAOD this has been formatted for `git am' use.

 lib/remote.exp |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: dejagnu/lib/remote.exp
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
# Reap it.
     set res [catch "wait -i $program_id" wres]
-    if {$exec_pid != -1} {
+    if { $exec_pid != -1 && [llength $pid] == 1 } {
        # We reaped the process, so cancel the pending force-kills, as
        # otherwise if the PID is reused for some other unrelated
        # process, we'd kill the wrong process.
+       #
+       # Do this if the PID list only has a single entry however, as
+       # otherwise `wait' will have returned once any single process
+       # of the pipeline has exited regardless of whether any other
+       # ones have remained.
        # Use `catch' in case the force-kills have completed, so as not
        # to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
     set r2 [close_wait_program $spawn_id $pid wres]
     if { $id > 0 } {
+       if { ! $got_eof } {
+           # If timed-out, don't wait for all the processes associated
+           # with the pipeline to terminate as a stuck one would cause
+           # us to hang.
+           set r2 [catch "fconfigure $id -blocking false" res]
+       }
        set r2 [catch "close $id" res]
     } else {
        verbose "waitres is $wres" 2

The big question I have is whether this patch breaks the GDB testsuite? The "cancel force-kill" mechanism was added to fix a problem with testing GDB.

-- Jacob

reply via email to

[Prev in Thread] Current Thread [Next in Thread]