bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bash may hang waiting for an already terminated job


From: Fabio Brugnara
Subject: bash may hang waiting for an already terminated job
Date: Mon, 7 Jan 2002 11:09:35 +0100

Configuration Information [Automatically generated, do not change]:
Machine: i686
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='i686' 
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='i686-pc-linux-gnu' 
-DCONF_VENDOR='pc' -DSHELL -DHAVE_CONFIG_H   -I.  -I. -I./include -I./lib -g
uname output: Linux mascara 2.2.17-14 #1 Mon Feb 5 18:48:50 EST 2001 i686 
unknown
Machine Type: i686-pc-linux-gnu

Bash Version: 2.05a
Patch Level: 0
Release Status: release

Description:

When a script runs many commands together  with a long-term background job,
it may happen that bash get stuck in a waitpid() call, waiting for the
termination of a foreground job which has actually already terminated.


Repeat-By:

The following script is a variation of the one sent in may 2000 by Sampo
Niskanen. The original version, which did not run the "/bin/sh -c .."
command in the background, made bash-2.04 get stuck after 10-30 minutes in
an infinite loop of invocation of waipid(-1, .., WNOHANG) which set
errno=ECHILD. A patch has been introduced in bash-2.05a so that it does not
happen anymore, but the real problem was not fixed. Actually, running this
script, bash-2.05a hangs again after 10-30 minutes, this time on a
blocking waitpid().


--------- begin test script -----------

#!/bin/sh

background () {
    # Use backgrounding.
    if test -z "$OLDPID" -o ! -e "/proc/$OLDPID" ; then
        # Previous version has exited or none has been ever started yet.
        # This is, of course, a bit more useful in the real version..
        (echo "IN-BG"; sleep 1; echo "OUT-BG") &
        OLDPID=$!
        echo "BG-LINE3:$OLDPID"
    fi
    # Always returns success.
    return 0
}

foreground () {
    echo "This might seem very useful" | cat
}

STARTTIME=`date "+%s"`

pid=$$
/bin/sh -c "
        while :; do
                kill -0 $pid 2>/dev/null || exit
                sleep 10
        done
" &

### Do an infinite loop
while : ; do
    echo "BEEN DOING THIS FOR `expr \`date "+%s"\` - $STARTTIME` SECONDS"
    background
    foreground
    usleep 50000
done

--------- end test script -----------

The problem is that bash updates information about jobs locating them
on the basis of pids, and this is not correct since pids can be recycled.
I made bash run this script under gdb, and, after the process hung, hit ^C
to examine the situation. A shortened trace of the session follows:

(Line numbers in stack trace may not correspond to the original source,
since there were some debugging printf inserted)

Program received signal SIGINT, Interrupt.
0x400b38c9 in __wait4 () from /lib/libc.so.6
(gdb) where
#0  0x400b38c9 in __wait4 () from /lib/libc.so.6
#1  0x401121cc in __DTOR_END__ () from /lib/libc.so.6
#2  0x8077ac9 in waitchld (wpid=31855, block=1) at jobs.c:2360
#3  0x8076945 in wait_for (pid=31855) at jobs.c:1764
#4  0x806873f in execute_command_internal (command=0x8105ac8, asynchronous=0,
    pipe_in=5, pipe_out=-1, fds_to_close=0x8105c48) at execute_cmd.c:660

... complete stack trace removed ...


(gdb) print jobs[11][0]
$1 = {wd = 0x80f3d08 "/hardmnt/mascara0/ssi/brugnara/utils/bash-2.05a",
  pipe = 0x80f5e28, pgrp = 31052, state = JDEAD, flags = 0, deferred = 0x0,
  j_cleanup = 0, cleanarg = 0x0}

(gdb) print jobs[11]->pipe[0]
$2 = {next = 0x80f5e28, pid = 31854, status = 0, running = 0,
  command = 0x80f3bc8 "( echo \"IN-BG\"; sleep 1; echo \"OUT-BG\" )"}

(gdb) print jobs[430][0]
$3 = {wd = 0x810d548 "/hardmnt/mascara0/ssi/brugnara/utils/bash-2.05a",
  pipe = 0x810c388, pgrp = 31052, state = JRUNNING, flags = 1, deferred = 0x0,
  j_cleanup = 0, cleanarg = 0x0}

(gdb) print jobs[430]->pipe[0]
$4 = {next = 0x80f52e8, pid = 31854, status = 0, running = 1,
  command = 0x810d848 "echo \"This might seem very useful\""}

(gdb) print jobs[430]->pipe->next[0]
$5 = {next = 0x810c388, pid = 31855, status = 0, running = 0,
  command = 0x80ea378 "cat"}

Notice that process 31854 appears in two jobs, 11 and 430. jobs[11] is an old
one, while jobs[430] is the foreground pipeline that has just exited.
When waitpid() reported termination of process 31854, bash found the first
entry in the job table (already marked as not running), and failed to
update the status of the correct entry, which is in fact still marked as
running. bash therefore keeps wait()ing because the current jobs does
not appear to be terminated.

Fix:

I made a quick fix to partially solve this problem. A flag is added to the
parameters of find_pipeline() and find_job() in order to restrict the
search to running jobs. This is set to 1 when looking for pids just
returned by waitpid(), that is when the functions are invoked by
waitchld(). With this patch applied, one is at least sure that, when
looking for processes that have just exited, the correct entry is found. In
fact, bash does not hang anymore running the test script.

-------- start patch ------------
*** jobs-original.c     Mon Nov  5 15:56:17 2001
--- jobs.c      Mon Jan  7 09:40:01 2002
***************
*** 224,230 ****
  
  static int waitchld __P((pid_t, int));
  
! static PROCESS *find_pipeline __P((pid_t));
  
  static char *current_working_directory __P((void));
  static char *job_working_directory __P((void));
--- 224,230 ----
  
  static int waitchld __P((pid_t, int));
  
! static PROCESS *find_pipeline __P((pid_t, int));
  
  static char *current_working_directory __P((void));
  static char *job_working_directory __P((void));
***************
*** 238,244 ****
  static int job_last_stopped __P((int));
  static int job_last_running __P((int));
  static int most_recent_job_in_state __P((int, JOB_STATE));
! static int find_job __P((pid_t));
  static int print_job __P((JOB *, int, int, int));
  static int process_exit_status __P((WAIT));
  static int job_exit_status __P((int));
--- 238,244 ----
  static int job_last_stopped __P((int));
  static int job_last_running __P((int));
  static int most_recent_job_in_state __P((int, JOB_STATE));
! static int find_job __P((pid_t, int));
  static int print_job __P((JOB *, int, int, int));
  static int process_exit_status __P((WAIT));
  static int job_exit_status __P((int));
***************
*** 792,850 ****
  /* Return the pipeline that PID belongs to.  Note that the pipeline
     doesn't have to belong to a job.  Must be called with SIGCHLD blocked. */
  static PROCESS *
! find_pipeline (pid)
       pid_t pid;
  {
!   int job;
!   register PROCESS *p;
! 
!   /* See if this process is in the pipeline that we are building. */
!   if (the_pipeline)
!     {
!       p = the_pipeline;
!       do
!       {
!         /* Return it if we found it. */
!         if (p->pid == pid)
!           return (p);
  
!         p = p->next;
        }
!       while (p != the_pipeline);
!     }
! 
!   job = find_job (pid);
! 
!   return (job == NO_JOB) ? (PROCESS *)NULL : jobs[job]->pipe;
  }
  
  /* Return the job index that PID belongs to, or NO_JOB if it doesn't
     belong to any job.  Must be called with SIGCHLD blocked. */
  static int
! find_job (pid)
       pid_t pid;
  {
!   register int i;
!   register PROCESS *p;
! 
!   for (i = 0; i < job_slots; i++)
!     {
!       if (jobs[i])
!       {
!         p = jobs[i]->pipe;
! 
!         do
!           {
!             if (p->pid == pid)
!               return (i);
  
!             p = p->next;
!           }
!         while (p != jobs[i]->pipe);
        }
!     }
! 
!   return (NO_JOB);
  }
  
  /* Find a job given a PID.  If BLOCK is non-zero, block SIGCHLD as
--- 792,851 ----
  /* Return the pipeline that PID belongs to.  Note that the pipeline
     doesn't have to belong to a job.  Must be called with SIGCHLD blocked. */
  static PROCESS *
! find_pipeline (pid, running)
       pid_t pid;
+      int running;
  {
!       int job;
!       register PROCESS *p;
  
!       /* See if this process is in the pipeline that we are building. */
!       if (the_pipeline) {
!               p = the_pipeline;
!               do {
!                       if(running) {
!                               if(p->pid==pid
!                                  &&(p->running||WIFSTOPPED(p->status))) {
!                                       return p;
!                                  }
!                       } else {
!                               if (p->pid == pid) return (p);
!                       }
!                       p = p->next;
!               } while (p != the_pipeline);
        }
!       job = find_job (pid, running);
!       return (job == NO_JOB) ? (PROCESS *)NULL : jobs[job]->pipe;
  }
  
  /* Return the job index that PID belongs to, or NO_JOB if it doesn't
     belong to any job.  Must be called with SIGCHLD blocked. */
  static int
! find_job (pid, running)
       pid_t pid;
+      int running;
  {
!       register int i;
!       register PROCESS *p;
  
!       for (i = 0; i < job_slots; i++) {
!               if (jobs[i]) {
!                       p = jobs[i]->pipe;
!                       do {
!                               if(running) {
!                                       if(p->pid==pid
!                                          &&(p->running
!                                             ||WIFSTOPPED(p->status))) {
!                                               return i;
!                                       }
!                               } else {
!                                       if (p->pid == pid) return i;
!                               }
!                               p = p->next;
!                       } while (p != jobs[i]->pipe);
!               }
        }
!       return (NO_JOB);
  }
  
  /* Find a job given a PID.  If BLOCK is non-zero, block SIGCHLD as
***************
*** 859,865 ****
  
    if (block)
      BLOCK_CHILD (set, oset);
!   job = find_job (pid);
    if (block)
      UNBLOCK_CHILD (oset);
  
--- 860,866 ----
  
    if (block)
      BLOCK_CHILD (set, oset);
!   job = find_job (pid, 0);
    if (block)
      UNBLOCK_CHILD (oset);
  
***************
*** 876,882 ****
  
    BLOCK_CHILD (set, oset);
  
!   job = find_job (pid);
  
    if (job != NO_JOB)
      printf ("[%d] %ld\n", job + 1, (long)pid);
--- 877,883 ----
  
    BLOCK_CHILD (set, oset);
  
!   job = find_job (pid, 0);
  
    if (job != NO_JOB)
      printf ("[%d] %ld\n", job + 1, (long)pid);
***************
*** 1507,1513 ****
    int r, job;
  
    BLOCK_CHILD (set, oset);
!   child = find_pipeline (pid);
    UNBLOCK_CHILD (oset);
  
    if (child == 0)
--- 1508,1514 ----
    int r, job;
  
    BLOCK_CHILD (set, oset);
!   child = find_pipeline (pid, 0);
    UNBLOCK_CHILD (oset);
  
    if (child == 0)
***************
*** 1521,1527 ****
    /* POSIX.2: if we just waited for a job, we can remove it from the jobs
       table. */
    BLOCK_CHILD (set, oset);
!   job = find_job (pid);
    if (job != NO_JOB && jobs[job] && DEADJOB (job))
      jobs[job]->flags |= J_NOTIFIED;
    UNBLOCK_CHILD (oset);
--- 1522,1528 ----
    /* POSIX.2: if we just waited for a job, we can remove it from the jobs
       table. */
    BLOCK_CHILD (set, oset);
!   job = find_job (pid, 0);
    if (job != NO_JOB && jobs[job] && DEADJOB (job))
      jobs[job]->flags |= J_NOTIFIED;
    UNBLOCK_CHILD (oset);
***************
*** 1671,1677 ****
  #define FIND_CHILD(pid, child) \
    do \
      { \
!       child = find_pipeline (pid); \
        if (child == 0) \
        { \
          give_terminal_to (shell_pgrp, 0); \
--- 1672,1678 ----
  #define FIND_CHILD(pid, child) \
    do \
      { \
!       child = find_pipeline (pid, 0); \
        if (child == 0) \
        { \
          give_terminal_to (shell_pgrp, 0); \
***************
*** 1731,1737 ****
         We check for JDEAD in case the job state has been set by waitchld
         after receipt of a SIGCHLD. */
        if (job == NO_JOB)
!       job = find_job (pid);
  
        /* waitchld() takes care of setting the state of the job.  If the job
         has already exited before this is called, sigchld_handler will have
--- 1732,1738 ----
         We check for JDEAD in case the job state has been set by waitchld
         after receipt of a SIGCHLD. */
        if (job == NO_JOB)
!       job = find_job (pid, 0);
  
        /* waitchld() takes care of setting the state of the job.  If the job
         has already exited before this is called, sigchld_handler will have
***************
*** 2248,2255 ****
    if (group)
      {
        BLOCK_CHILD (set, oset);
!       p = find_pipeline (pid);
!       job = find_job (pid);
  
        if (job != NO_JOB)
        {
--- 2249,2256 ----
    if (group)
      {
        BLOCK_CHILD (set, oset);
!       p = find_pipeline (pid, 0);
!       job = find_job (pid, 0);
  
        if (job != NO_JOB)
        {
***************
*** 2370,2376 ****
        children_exited++;
  
        /* Locate our PROCESS for this pid. */
!       child = find_pipeline (pid);
  
        /* It is not an error to have a child terminate that we did
         not have a record of.  This child could have been part of
--- 2371,2377 ----
        children_exited++;
  
        /* Locate our PROCESS for this pid. */
!       child = find_pipeline (pid, 1);
  
        /* It is not an error to have a child terminate that we did
         not have a record of.  This child could have been part of
***************
*** 2382,2393 ****
        while (child->pid != pid)
        child = child->next;
  
        /* Remember status, and whether or not the process is running. */
        child->status = status;
        child->running = WIFCONTINUED(status) ? 1 : 0;
  
-       job = find_job (pid);
-   
        if (job == NO_JOB)
        continue;
  
--- 2383,2394 ----
        while (child->pid != pid)
        child = child->next;
  
+       job = find_job (pid, 1);
+ 
        /* Remember status, and whether or not the process is running. */
        child->status = status;
        child->running = WIFCONTINUED(status) ? 1 : 0;
  
        if (job == NO_JOB)
        continue;
  
-------- end patch ------------

However, I think it is the very use of find_job() that should be avoided,
because its correctness relies on the wrong assumption that a pid is
uniquely associated with one job. By looking at the various invocation of
wait_for() within the source, it seems to me that it should be always
possible to use the job identifier as a parameter, so that it would not be
necessary for the function to locate the job by itself. The only exception
coud be when an explicit invocation of the builtin command "wait PID" is
made by the script, in which case ambiguity would persist. It is the
first time I look inside bash sources, so I did not dare to make scattered
changes, and leave the fix to the ones who know better...

I have another observation. It looks like the use of the sigchld and
waiting_for_job variables is superfluous, since waitchld() is called with
SIGCHLD blocked. Is this correct?

I hope you will find the time to fix these issues, and I thank you very much
for your excellent work on bash,

best regards,
Fabio Brugnara




reply via email to

[Prev in Thread] Current Thread [Next in Thread]