info-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CVS server and SIGPIPE hang (1.11.2)


From: Ed Santiago
Subject: CVS server and SIGPIPE hang (1.11.2)
Date: Mon, 23 Sep 2002 19:04:00 -0600

Greetings,

========
Synopsis
========

Under some circumstances a CVS server process can hang.  This may
be more common on SMP servers.  A description of the failure mode
is enclosed, along with my analysis of the flaw and two possible
solutions.  No proper (suitable for checkin) patch is supplied.


=======
Details
=======

Many moons ago (1998), Xeno <address@hidden> wrote [1]:

    There's a problem with the flow-control done by the cvs server process I
    thought people might be interested to know about.  Workaround is easy,
    just comment out the #define SERVER_FLOWCONTROL in src/options.h and put
    a rebuilt executable on the server side.

    [1] http://www.cvshome.org/cyclic/cvs/dev-sigpipe.txt

At some point between 1.10.5 and 1.11.2, the failure mode worsened.
Instead of the client acknowledging the SIGPIPE and aborting, now
the server process hangs forever, hence so does the client.  This
happens quite frequently on a Linux 2.4.19 2-CPU SMP server.

An effective way to reproduce the problem is to run something like:

    client$ for i in `seq 1 1000`;do echo -ne "\r$i...";nice -19 cvs -n status 
-v platforms/m20t1.mk >| /tmp/trashme;done;echo

(the 'nice' is helpful in causing the server's output pipe to fall
behind).  At some point, one of those commands will hang, and the
server will show one zombie cvs process:

    server$ ps auxww|grep cvs
    esm      16262  0.2  0.1  5316 3732 ?        S    15:09   0:00 
/home/esm/cvs-1.11.2 server
    esm      16276  0.1  0.0     0    0 ?        Z    15:09   0:00 [cvs-1.11.2 
<defunct>]

A typical bt shows the server parent process (16262, above) hung in:

    #0  0x420daca4 in read () from /lib/i686/libc.so.6
    #1  0x4213030c in __DTOR_END__ () from /lib/i686/libc.so.6
    #2  0x420757cf in _IO_new_file_underflow () from /lib/i686/libc.so.6
    #3  0x42077d09 in _IO_default_uflow_internal () from /lib/i686/libc.so.6
    #4  0x42076f87 in __uflow () from /lib/i686/libc.so.6
    #5  0x420724e3 in getc () from /lib/i686/libc.so.6
    #6  0x0804e68e in stdio_buffer_shutdown (buf=0x80d4648) at buffer.c:1381
    #7  0x0804e497 in buf_shutdown (buf=0x80d4648) at buffer.c:1207
    #8  0x0808879f in server_cleanup (sig=13) at server.c:4889
    #9  0x080a41a9 in SIG_handle (sig=13) at sighandle.c:158
    #10 <signal handler called>
    #11 0x420dace4 in write () from /lib/i686/libc.so.6

(all line numbers apply to pure & unsullied cvs 1.11.2)


==========
Root Cause
==========

This is a pretty definite race condition caused by the fact that
there's no negotiation between the server parent & child for the
child to terminate.  If the child emits one final burst of output
then exits, and the parent tries to throttle the child using
the flow control pipe, the parent will get a well-deserved SIGPIPE.


====================
Possible Solution #1
====================

One solution might be to define a new part of the protocol:

  child (to parent): may I die?  [pauses until it gets a response]

  parent (to child): yes, you may.  I will immediately close
                     my flowcontrol_pipe, and trouble you no more.

  child: GAAAaaaahhhhhh... [croaks]

IMHO this is not the ideal solution: first, it requires piggybacking
the "may I die" question onto protocol_pipe, and the "yes you may"
onto flowcontrol_pipe.  Not bad, but tricky (for me).  Second, as
a new protocol extension, it requires negotiation.  *Shudder*.

====================
Possible Solution #2
====================

SIGPIPE is actually an exception, not an asynchronous interrupt.
At least, I can find no situation in which a SIGPIPE will come
floating in out of the blue.  Hence it should be safe to block
SIGPIPE temporarily when the parent writes to flowcontrol_pipe.

The patch below (ugly; highly nonportable; for reference only;
some restrictions may apply; see store for details) is one possible
way to implement this.  In over ten thousand runs, I was unable
to get a cvs server with this patch to hang.


=======
Summary
=======

To play it safe, we're running with SERVER_FLOWCONTROL turned
off at our site (requires post-1.11.2 patch in order for server.c
to compile).

This message is intended as an informational heads-up to someone
who may be in a better position to analyze and fix the race
condition.  If there is any further information I can provide,
please let me know.

Thanks for your time,
^E
-- 
Ed Santiago                 Toolsmith                 address@hidden


Index: server.c
===================================================================
RCS file: /home/cvsroot/tools/cvs/src/server.c,v
retrieving revision 1.14
diff -u -r1.14 server.c
--- server.c    2002/09/14 21:35:55     1.14
+++ server.c    2002/09/23 20:49:04
@@ -2927,13 +2927,21 @@
            bufmemsize = buf_count_mem (buf_to_net);
            if (!have_flowcontrolled && (bufmemsize > SERVER_HI_WATER))
            {
+               void *preserved_sigpipe_handler;
+
+               preserved_sigpipe_handler = (void*)signal(SIGPIPE, SIG_IGN);
                if (write(flowcontrol_pipe[1], "S", 1) == 1)
                    have_flowcontrolled = 1;
+               (void)signal(SIGPIPE, preserved_sigpipe_handler);
            }
            else if (have_flowcontrolled && (bufmemsize < SERVER_LO_WATER))
            {
+               void *preserved_sigpipe_handler;
+
+               preserved_sigpipe_handler = (void*)signal(SIGPIPE, SIG_IGN);
                if (write(flowcontrol_pipe[1], "G", 1) == 1)
                    have_flowcontrolled = 0;
+               (void)signal(SIGPIPE, preserved_sigpipe_handler);
            }
 #endif /* SERVER_FLOWCONTROL */
 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]