[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reboots?

From: Roland McGrath
Subject: Re: Reboots?
Date: Sun, 1 Apr 2001 03:46:03 -0400 (EDT)

> I have reproduced exactly the crash Jeff reported. I have collected the data.
> I used a ring buffer of 16 entries (can increase if needed), and the full

I gather from your data that what you mean is a buffer of the last 16
messages handled by proc's demuxer?  In the general case one would would
want to track the reply messages too, and see if there is interleaving of
the RPCs, i.e. a second RPC beginning processing before another has
finished.  But in this case we know that this is the sequence of calls done
by fork, which does them all serially.

Where did you put your code to write into your buffer?  You want it to be
the very first thing in libports's internal_demuxer.  If you put in proc's
demuxer, then ports_lookup_port and so forth happen before you make the
record--so we would miss the final message if it's in the libports code
where it crashes.

> gdb log is attached. Here are the three ports on which RPCs where logged
> immediately before the crash (in interleaved order, see left column). 

This sequence of calls is clearly fork.  Are you sure you have the global
ordering of those messages right?  Even though different server threads may
handle each request, all these calls should be serialized in the caller
(fork) so that proc doesn't get each RPC until it's replied to the last.
The sequence of RPCs from fork should look like:

dostop -> parent proc
proc_task2proc -> parent proc, arg of child task
{ either order of:
 proc_setmsgport -> child proc
 task2proc -> parent proc, arg of child // just noticed we can get rid of this
proc_get_arg_locations -> parent proc
proc_set_arg_locations -> child proc
proc_task2pid -> parent proc, arg of child task
proc_child -> parent proc, arg of child task
-- child task starts running, now it is the sender of these msgs: 
proc_getpids -> child proc
proc_handle_exceptions -> child proc

After the child starts running, the parent side of fork makes no more RPCs
to proc, though it might return in parallel with the child running and the
program might do something else that makes a proc RPC.

Oh, I see.  This ordering is perfectly right for a fork where the child
then does another fork.  If these are the only proc calls going on here
then I don't know what funny interaction there could be.

>   port 218:
> real-
> order bits            size    seqno   id
> 1.    2147488018      32      1246    24021 dostop
> 2.                            1247    24031 task2proc
> 3.                            1248    24031
> 5.                            1249    24018 get_arg_locations
> 7.                            1250    24030 task2pid
> 8.                            1251    24012 child
>   port 229:
> order bits            size    seqno   id
> 4.    2147488018      32      0       24013 setmsgport
> 6.    4370            40      1       24017 set_arg_locations
> 9.                    24      2       24016 getpids
> 10.   2147488018      120     3       24022 handle_exceptions
> 11.                   32      4       24021 dostop
> 12.                           5       24031 task2proc
> 13.                           6       24031
> 15.   4370            24      7       24018 get_arg_locations
>   port 279:
> order bits            size    seqno   id
> 14.   2147488018      32      0       24013 setmsgport
> 16.   4370            40      1       24017 set_arg_locations
>  *** crash ***
> Of course, one data point is not very much. I can run this a few more times,
> and we can see if a pattern emerges. We can insert assertions etc.
> We can probably log whole messages.

Just the headers ought to be enough to understand what's going on.
I don't know what assertions to suggest inserting.  

> Can we run proc single threaded, so that we know where exactly it crashed?

We can't make proc single-threaded, because wait works using condition
variables.  However, multithreadedness is not what is preventing us from
seeing the crash location.  It is because it jumps off into nowhere, and/or
clobbers its stack, that we have trouble figuring out where it went bad.

proc is essentially totally serialized by its global_lock.  I think the
problem is probably a stack clobberation.  Just the right kind of
corruption of the stack during an RPC server function could cause that RPC
to complete fine and send its reply, but leave a little time bomb that will
cause this thread to crash whenever it happens to be the one to dequeue a
message from the portset.  In such a scenario, it could have been a totally
unrelated RPC much earlier that left a thread waiting to crash, and just
this flurry of RPCs happened to run through other request threads so that
the one with the corrupted stack came up as ready for the next portset msg.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]