[patch] parallel make bug with failing commands

bug-make
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[patch] parallel make bug with failing commands

From:	Michael Matz
Subject:	[patch] parallel make bug with failing commands
Date:	Sun, 31 Jul 2005 05:57:57 +0200 (CEST)
Hi,

[please keep me CCed, I'm not subscribed to bug-make]

I've noticed the problem with make 3.80 during building GCC.  I can 
reproduce it with a small makefile, also with current CVS of GNU make.

First I describe the symptoms, and then the bug.  The former is a bit 
long, so you might skip to the description of the bug, which is obvious 
once knowing where to look.

See this Makefile:
----------------------------
.PHONY: all fail1 fail2 fail3 ok1 ok2 ok3
all: fail1 ok1 fail2 ok2 fail3 ok3

fail1 fail2 fail3:
        echo Fail
        exit 1

ok1 ok2 ok3:
        echo Ok
        sleep 2
        echo ok done
----------------------------

So, we have a mixture of failing and winning commands, where the winning 
commands need quite some time to finish.  makeing the above in parallel 
will result sometimes in make not waiting for all started jobs before 
exiting.  A multi-CPU machine increases the possibility of this happening.  
Higher number for -jN increase it too (I usually can reproduce it just 
fine with -j6, i.e. with the max parallelity for this makefile, but others 
might have to add more targets).

This is an example of the bug:

% make -r -j5 ; echo "============================="; pp sleep
echo Fail
Fail
exit 1
echo Ok
echo Fail
echo Ok
echo Fail
Ok
sleep 2
Fail
exit 1
make: *** [fail3] Error 1
make: *** Waiting for unfinished jobs....
make: *** [fail1] Error 1
Ok
sleep 2
Fail
exit 1
make: *** [fail2] Error 1
=============================
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
matz     14483  0.0  0.1  7112  736 pts/0    S    06:02   0:00 sleep 2
matz     14485  0.0  0.1  7112  736 pts/0    S    06:02   0:00 sleep 2

Note how even after 'make' stoped there are still two sleeps running on 
the system.

The above example is of course harmless.  But this also happens if the 
commands are submakes, which then hang around without a controling parent 
make.  And worse, a make can return to the shell (with an error), while 
some sub-makes are still building stuff in some directories.  If one tries 
to work on after the top make returned, one might see confusing effects 
from those submakes (e.g. files magically appearing in subdirs, command 
output in the terminal, and generally annoying things).  Killing all these 
sub-makes by hand can be cumbersome if there are many (I have machines 
where I can build GCC with parallelity of 32, and something of the above 
happened to me.  I rather waited some time until the sub-makes where done 
on their own, instead of hunting them down).

To demonstrate the above effect with sub-makes involved, just change the 
top-level Makefile to:

----------------------------------
.PHONY: all fail1 fail2 fail3 ok1 ok2 ok3
all: fail1 ok1 fail2 ok2 fail3 ok3

ok1 ok2 ok3 fail1 fail2 fail3:
        $(MAKE) -C $@
----------------------------------

Where the */Makefile contain the same commands from above appropriately 
separated for the ok* and fail* subdirs.  An example output would look 
like:

% ./make/make/make -r -j6 ; pp sleep
/tmp/par-make/./make/make/make -C fail1
/tmp/par-make/./make/make/make -C ok1
/tmp/par-make/./make/make/make -C fail2
/tmp/par-make/./make/make/make -C ok2
/tmp/par-make/./make/make/make -C fail3
/tmp/par-make/./make/make/make -C ok3
make[1]: Entering directory `/tmp/par-make/ok1'
make[1]: Entering directory `/tmp/par-make/fail2'
make[1]: Entering directory `/tmp/par-make/fail1'
make[1]: Entering directory `/tmp/par-make/ok2'
make[1]: Entering directory `/tmp/par-make/fail3'
make[1]: Entering directory `/tmp/par-make/ok3'
Fail /tmp/par-make/fail2
exit 1
Ok /tmp/par-make/ok3
Ok /tmp/par-make/ok1
Fail /tmp/par-make/fail3
exit 1
Fail /tmp/par-make/fail1
Ok /tmp/par-make/ok2
exit 1
make[1]: *** [all] Error 1
make[1]: Leaving directory `/tmp/par-make/fail2'
make: *** [fail2] Error 2
make: *** Waiting for unfinished jobs....
make[1]: *** [all] Error 1
make[1]: Leaving directory `/tmp/par-make/fail3'
make[1]: *** [all] Error 1
make[1]: Leaving directory `/tmp/par-make/fail1'
make: *** [fail3] Error 2
make: *** [fail1] Error 2
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
matz      9765  0.0  0.0  7120  740 pts/5    S    05:18   0:00 sleep 2
matz      9766  0.0  0.0  7120  740 pts/5    S    05:18   0:00 sleep 2
matz      9769  0.0  0.0  7120  740 pts/5    S    05:18   0:00 sleep 2
address@hidden % Ok /tmp/par-make/ok3 done
make[1]: Leaving directory `/tmp/par-make/ok3'
Ok /tmp/par-make/ok1 done
Ok /tmp/par-make/ok2 done
make[1]: Leaving directory `/tmp/par-make/ok1'
make[1]: Leaving directory `/tmp/par-make/ok2'

Note how the prompt is there already, and after that some output from the 
sub-makes working in ok[123] .  I spare us the output of running make with 
the -d option, what happens is, that make suddenly exits, although there 
are still job slots in use.

I know why this happens.  The problem is the interaction between die() and
reap_children() when multiple failing jobs are in queue and the user does
not use -k.  Let's suppose there are five job slots in use (reflecting all
three failing and two ok jobs).  The first failing one will trigger
"reap_children(0, 0)" somewhen, and then the chain of events goes like so:

reap_children (0, /*err= */ 0)
  # reap the failing child fail1
  # if (!err && child_failed && !keep_going_flag)
  #   die (2);
    
die (2)
  # this is the first call, hence dying is 0, ergo it does:
  # dying = 1
  # for (err = (status != 0); job_slots_used > 0; err = 0)
  #      reap_children (1, err);
  # status == 2, hence err will be 1 in the first call

reap_children (1, 1)
  # suppose this will get the second failing job, fail2
  # if (!err && child_failed && !keep_going_flag)
  #   die (2);
  # as err == 1, this will not call die(2).  Instead it set blocks=0
  # repeats the loop, and exits it, as no other childs are dead,
  # so we return to the above die (2) activation

# We are in this loop again:
  # for (err = (status != 0); job_slots_used > 0; err = 0)
  #      reap_children (1, err);
  # right now job_slots_used is 3 (the last fail job, and the two ok jobs)
  # this time, the second iteration, i.e. err is now 0, so we do:

reap_children (1, 0)
  # We now reap the third failing child, fail3
  # err is 0, hence we do this:
  # if (!err && child_failed && !keep_going_flag)
  #   die (2);

die (2)
  # as dying is set, we jump over the cleanup
  # and just do:
  exit (2)

Voila.  We don't wait for the two last jobs ok1 and ok2.  Note that the 
timing here is critical.  If in the second reap_children invocation both 
remaining fail jobs are done, then they will be reaped by that activation 
already, and hence don't lead to a recursive die() call in the last 
reap_children() invocation.

The problem is, that the 'err' variable is used to control two things, 
namely if the 'Waiting for unfinished jobs....' warning should be printed, 
_and_ if die() should be called recursively.  As the warning should be 
printed only once, 'err' is reset after the first iteration.  But that 
leads to a recursive invocation of die() which just exits the whole make, 
and misses to complete the iteration of the waiting loop in the upper 
die() activation.

I used the below patch to fix this problem.  It produces no regressions in 
the testsuite.  It might perhaps be a good idea tp test that 
job_slots_used is 0 right before doing the exit() in die().  It would have 
catched this bug.

I hope this makes sense.


Ciao,
Michael.
-- 

Index: job.c
===================================================================
RCS file: /cvsroot/make/make/job.c,v
retrieving revision 1.166
diff -u -p -r1.166 job.c
--- job.c       26 Jun 2005 03:31:30 -0000      1.166
+++ job.c       31 Jul 2005 03:50:43 -0000
@@ -475,9 +475,12 @@ reap_children (int block, int err)
 
       if (err && block)
        {
+         static printed = 0;
          /* We might block for a while, so let the user know why.  */
          fflush (stdout);
-         error (NILF, _("*** Waiting for unfinished jobs...."));
+         if (!printed)
+           error (NILF, _("*** Waiting for unfinished jobs...."));
+         printed = 1;
        }
 
       /* We have one less dead child to reap.  As noted in
Index: main.c
===================================================================
RCS file: /cvsroot/make/make/main.c,v
retrieving revision 1.210
diff -u -p -r1.210 main.c
--- main.c      12 Jul 2005 04:35:13 -0000      1.210
+++ main.c      31 Jul 2005 03:50:44 -0000
@@ -2990,7 +2990,7 @@ die (int status)
        print_version ();
 
       /* Wait for children to die.  */
-      for (err = (status != 0); job_slots_used > 0; err = 0)
+      for (err = (status != 0); job_slots_used > 0;)
        reap_children (1, err);
 
       /* Let the remote job module clean up its state.  */
[Prev in Thread]
Current Thread
[Next in Thread]
[patch] parallel make bug with failing commands, Michael Matz <=
- Re: [patch] parallel make bug with failing commands, Paul D. Smith, 2005/07/31
- Re: [patch] parallel make bug with failing commands, Paul D. Smith, 2005/07/31
Prev by Date: Re: [patch] parallel make bug with failing commands
Previous by thread: comments in eval
Next by thread: Re: [patch] parallel make bug with failing commands
Index(es):
- Date
- Thread