bug-make
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #40725] Make could completely freeze during a parallel build in som


From: Florent Viard
Subject: [bug #40725] Make could completely freeze during a parallel build in some particular conditions
Date: Wed, 27 Nov 2013 19:16:15 +0000
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0

URL:
  <http://savannah.gnu.org/bugs/?40725>

                 Summary: Make could completely freeze during a parallel build
in some particular conditions
                 Project: make
            Submitted by: fviard
            Submitted on: mer. 27 nov. 2013 19:16:14 GMT
                Severity: 3 - Normal
              Item Group: Bug
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
       Component Version: 3.81
        Operating System: POSIX-Based
           Fixed Release: None
           Triage Status: None

    _______________________________________________________

Details:

I have a particular setup where I run make to build a simple package inside a
using Scratchbox2, through a fakeroot, inside a chroot.
I wasn't able to reproduce this issue outside of this setup, but it is
possible that is only a question of timing, the make running in my special
setup being more slow.

In my case, I try to build an old version of "dosfstools" with "make -j 8".
The makefile is like that:
-----------------
DESTDIR =
PREFIX = /usr/local
SBINDIR = $(PREFIX)/sbin
DOCDIR = $(PREFIX)/share/doc
MANDIR = $(PREFIX)/share/man

OPTFLAGS = -O2 -fomit-frame-pointer $(shell getconf LFS_CFLAGS)
WARNFLAGS = -Wall -Wextra -Wno-sign-compare -Wno-missing-field-initializers
-Wmissing-prototypes -Wstrict-prototypes
DEBUGFLAGS =
CFLAGS += $(OPTFLAGS) $(WARNFLAGS) $(DEBUGFLAGS)

VPATH = src

all: build

build: dosfsck dosfslabel mkdosfs

dosfsck: boot.o check.o common.o fat.o file.o io.o lfn.o dosfsck.o

dosfslabel: boot.o check.o common.o fat.o file.o io.o lfn.o dosfslabel.o

mkdosfs: mkdosfs.o

...
-----------------


I don't notice this issue if I replace this line:
OPTFLAGS = -O2 -fomit-frame-pointer $(shell getconf LFS_CFLAGS)
by this one:
OPTFLAGS = -O2 -fomit-frame-pointer -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64


When make is frozen, I can see the following process tree:
...  |-sh---fakeroot---sb2-monitor-+-bash---bash---make---qemu-arm
Basically, qemu-arm is what effectively run the "getconf" command.

The interesting point is that "qemu-arm" is in Zombie state.
So it has already completed but make havn't yet done a waitpid for it.


I did my work with Make 3.81, but I noticed no change to the following parts
of code in last versions of Make.

Inside job.c, inside the new_job() function, there is the following piece of
code inside a while loop:
        /* Make sure we have a dup'd FD.  */
        if (job_rfd < 0)
          {
            DB (DB_JOBS, ("Duplicate the job FD\n"));
            job_rfd = dup (job_fds[0]);
          }

        [...]

        /* Reap anything that's currently waiting.  */
        reap_children (0, 0);

        /* Kick off any jobs we have waiting for an opportunity that
           can run now (i.e., waiting for load). */
        start_waiting_jobs ();

        /* If our "free" slot has become available, use it; we don't need an
           actual token.  */
        if (!jobserver_tokens)
          break;

        /* There must be at least one child already, or we have no business
           waiting for a token. */
        if (!children)
          fatal (NILF, "INTERNAL: no children as we go to sleep on read\n");

        [...]

       /* Set interruptible system calls, and read() for a job token.  */
        set_child_handler_action_flags (1, waiting_jobs != NULL);
        got_token = read (job_rfd, &token, 1);
        saved_errno = errno;
        set_child_handler_action_flags (0, waiting_jobs != NULL);


Basically, set_child_handler_action_flags() will enable restarting syscall in
case of signal interruption before and disable it after the read
and it will also enable the following signal_handler for the rest of the
execution of the process:

    RETSIGTYPE
    child_handler (int sig UNUSED)
    {
      ++dead_children;

      if (job_rfd >= 0)
        {
          close (job_rfd);
          job_rfd = -1;
        }
      [...]
    }


The idea here is to be able to interrupt the blocking read if something
happend to the child.

Later, if the process is able to acquire a "work slot", the shell command will
be executed through "func_shell" function of "function.c".
(or "func_shell_base" function in Make4.0)

    [
      First pipedes pipe will be created, and the shell command run keeping
the write side of the pipe.
      Then, the make (parent) process do a blocking read of pipedes[0] (read
side, child output),
      until the child process complete.
    ]
      for infinite:
        [...]
        EINTRLOOP (cc, read (pipedes[0], &buffer[i], maxlen - i));
        if (cc <= 0)
            break;
      }
      buffer[i] = '\0';

      /* Close the read side of the pipe.  */
      [...]
      (void) close (pipedes[0]);

Here, the blocking read will use the fact that interrupted syscall are not
automatically retried to be interrupted when the program receive any early
SIGCHLD and not risk to be blocked for ever on the read.

So we arrive at the issue. Sometimes, make is frozen at the following "close"
call:
(void) close (pipedes[0]);

But, if I put a "printf" just before the close, the issue is not
reproducible.
Looking at gdb when make is stuck, I can see the following backtrace:

----------------------------
0xb76557c4 in accept () at ../sysdeps/unix/sysv/linux/i386/socket.S:57
57  in ../sysdeps/unix/sysv/linux/i386/socket.S
(gdb) bt
#0  0xb76557c4 in accept () at ../sysdeps/unix/sysv/linux/i386/socket.S:57
#1  0xb783cec8 in ?? ()
#2  0xb77648c6 in rcmd_af (ahost=0xb783cec8, rport=0, locuser=0x0,
remuser=0x0, cmd=0xb780d8a9 "\201\303\377\340\002", fd2p=0xb783b9a8,
    af=<value optimized out>) at rcmd.c:236
#3  0xb780d8da in ?? ()
#4  0xb780db1a in ?? ()
#5  0xb77fe415 in ?? ()
#6  0x08054430 in child_handler (sig=17) at job.c:436
#7  <signal handler called>
#8  0xb77648db in rcmd_af (ahost=0xb783cec8, rport=1, locuser=0x0,
remuser=0xb783b9a8 "h\250\005", cmd=0xb780d919 "\201Ï\340\002",
fd2p=0xb783b9a8,
    af=<value optimized out>) at rcmd.c:286
#9  0xb780d949 in ?? ()
#10 0xb77fe415 in ?? ()
#11 0x0805106e in func_shell (o=0x949f849 "ssing-field-initializer",
argv=0xbf93f560, funcname=0x8067158 "shell") at function.c:1737
----------------------------

It is maybe not really clear like that, but what happened is that we enter the
close func to close pipedes[0] in #11, but during this execution, the
"child_handler" signal handler is triggered because of the termination of the
child subprocess. And inside this signal handler, as there is a value in
"job_rfd" var, another close is called to close the file descriptor of the
pipe identified by job_rfd.
So, make is there stuck trying to execute a close, inside the signal handler
that was executed inside another close for another file descriptor.

So, I have 2 theories, 1) there is something not signal safe inside my libc or
environment, 2) in my particular setup, environment produce often the correct
timing to have the read that terminate because of the end of the input of the
child process, and just then the SIGCHLD signal arrive just in the same time
as the close function is called. (Bad luck :p)



Anyway, I think that there is something bad in the current code that should be
fixed even if the issue is not really reproducible for different setup.
So, I have 2 proposal of solutions that works correctly for the current code
in "job.c":

1) Always close job_rfd after the read, so the close in the signal handler
will not be executed later than during this read call.
-------------
    set_child_handler_action_flags (1, waiting_jobs != NULL);
    got_token = read (job_rfd, &token, 1);
    saved_errno = errno;
-> +if (job_rfd >= 0){
-> +    close (job_rfd);
-> +    job_rfd = -1;
-> +}
    set_child_handler_action_flags (0, waiting_jobs != NULL);
-------------
( Because I don't know if it is really useful to try to preserve job_rfd for
next iteration for not having to dup() again )


2) Add a new variable that will act like some kind of "lock" to be sure that
in child_handler, the close will only be called during the interesting read
call.
-------------
-    RETSIGTYPE
-    child_handler (int sig UNUSED)
-    {
-      ++dead_children;
-
-      if (job_rfd >= 0)
-        {
-          close (job_rfd);
-          job_rfd = -1;
-        }
-     [...]
-    }

+    int should_handler_close_rfd = 0;
+    RETSIGTYPE
+    child_handler (int sig UNUSED)
+    {
+      ++dead_children;
+
+      if (job_rfd >= 0 && should_handler_close_rfd == 1)
+        {
+          close (job_rfd);
+          job_rfd = -1;
+        }
+      [...]
+    }

...

   /* Set interruptible system calls, and read() for a job token.  */
-    set_child_handler_action_flags (1, waiting_jobs != NULL);
-    got_token = read (job_rfd, &token, 1);
-    saved_errno = errno;
-    set_child_handler_action_flags (0, waiting_jobs != NULL);

+    set_child_handler_action_flags (1, waiting_jobs != NULL);
+    should_handler_close_rfd = 1;
+    got_token = read (job_rfd, &token, 1);
+    saved_errno = errno;
+    should_handler_close_rfd = 0;
+    set_child_handler_action_flags (0, waiting_jobs != NULL);



(Option 2 is my favorite one)

In the hope that you will be able to understand my big bug report :-)




    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?40725>

_______________________________________________
  Message posté via/par Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]