[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing

From: Jacob Bachmeyer
Subject: Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang
Date: Thu, 21 May 2020 20:36:47 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20090807 MultiZilla/ SeaMonkey/1.1.17 Mnenhy/

Maciej W. Rozycki wrote:
On Wed, 20 May 2020, Jacob Bachmeyer wrote:
So the solution to the problem is twofold. First pending force-kills are not killed after `wait' if there are more than one PID in the list passed to `close_wait_program'. This follows the observation that if there was only one PID on the list, then the process must have been created directly by `spawn' rather than by assigning a spawn ID to a pipeline and the return from `wait' would mean the process associated with the PID must have already been cleaned up after, so it is only when there are more there is a possibility any may have been left behind live.
Why are some cases assigning a spawn ID to a pipeline and others directly using spawn?

Have you seen the descriptions around `local_exec'? Or are you after further explanations?

If I understand correctly, the problems occur when local_exec starts a pipeline using open instead of directly running the process using spawn. Could this be fixed at the source by using spawn to run {/bin/sh -c "exec \"$commandline\" $inp $outp 2>&1"} (including the braces this time) instead of opening a pipeline? The shell would set up the redirections and then replace itself with the requested process.

Second if a pipeline has been used, then the channel associated with the pipeline is set to the nonblocking mode in case any of the processes that may have left live is stuck in the noninterruptible sleep (aka D) state. Such a process would necessarily ignore even SIGKILL so long as it remains in that state and would cause wait(2) called by `close' to hang possibly indefinitely, and we want the testsuite to proceed rather than hang even in bad circumstances.
This amounts to leaking PIDs but should be tolerable since the condition causing it is abnormal and external to the testsuite.

Having a process permanently stuck in the D state is exceedingly rare and often signifies a hardware failure (oh well) or a software bug somewhere (but we want to be able to run software verification with buggy systems, sometimes even with imperfect hardware we want to bring up, or otherwise how would we chase those bugs?), and since the process has been treated with SIGKILL already it will go away right away if it ever leaves that state.

I have had exactly that happen a long time ago with a bug in the ext3 filesystem driver: the process that triggered the bug died at the oops (IIRC) but any process that touched the broken mountpoint afterwards would get stuck in D state. (Really Bad when that mountpoint is /home!) The only recovery was to use the SysRq reboot trigger because attempting an orderly shutdown would hang.

Otherwise there is nothing really we can do about the process as it's the kernel that holds it. Therefore I wouldn't consider it PID leakage.

It is a leak caused by something we do not control and for which we are not responsible, but could there be some way to detect repeated occurrences of this condition and abort the testsuite before effectively forkbombing the system?

Index: dejagnu/lib/remote.exp
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
# Reap it.
     set res [catch "wait -i $program_id" wres]
-    if {$exec_pid != -1} {
+    if { $exec_pid != -1 && [llength $pid] == 1 } {
        # We reaped the process, so cancel the pending force-kills, as
        # otherwise if the PID is reused for some other unrelated
        # process, we'd kill the wrong process.
+       #
+       # Do this if the PID list only has a single entry however, as
+       # otherwise `wait' will have returned once any single process
+       # of the pipeline has exited regardless of whether any other
+       # ones have remained.
        # Use `catch' in case the force-kills have completed, so as not
        # to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
     set r2 [close_wait_program $spawn_id $pid wres]
     if { $id > 0 } {
+       if { ! $got_eof } {
+           # If timed-out, don't wait for all the processes associated
+           # with the pipeline to terminate as a stuck one would cause
+           # us to hang.
+           set r2 [catch "fconfigure $id -blocking false" res]
+       }
        set r2 [catch "close $id" res]
     } else {
        verbose "waitres is $wres" 2
The big question I have is whether this patch breaks the GDB testsuite? The "cancel force-kill" mechanism was added to fix a problem with testing GDB.

I'm not set up to run remote GDB testing with QEMU at the moment; I'll see if I can either wire it or use a different system with real remote hardware. I think it is not going to matter as commit e7d7a3e0b0cd mentions GDB itself being affected rather than any test cases run (via `gdbserver') on the remote system, and I'd expect GDB to be directly spawned rather than run as a pipeline, in which case we'll have a single PID only, the invocation of `wait' will have wait(2)ed on it and the force-kills will be cancelled as per the `[llength $pid] == 1' condition.

In that case, at least check that the patch does not break the native GDB testsuite.

-- Jacob

reply via email to

[Prev in Thread] Current Thread [Next in Thread]