[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reboots?

From: Roland McGrath
Subject: Re: Reboots?
Date: Sun, 1 Apr 2001 18:14:27 -0400 (EDT)

> Ok. I put it in proc's demuxer, so we miss out anything that is processed
> earlier. I will move the code into libproc before I do another test (and
> increase the buffer a bit).

That will have a better chance of catching the crash location, but 
it still could easily be obscured if the corruption is sufficient.

> Yes, because the code is in the global lock. Thanks for the RPC fork
> sequence it is useful.

It's also useful to think about the methodology by which I was able to
point to this quickly.  I maintain glimpse indexes of all my source trees.
When looking at this kind of trace, I can say "hmm, proc_set_arg_locations
is rarely used", and then an easy glimpse search shows me all the uses in
two seconds and I can quickly see that fork is the only one that this could be.

> I thought of something verifying that the stack is ok. I don't have any
> concrete ideas, though.

In the general case there isn't really anything to do but unwind the stack
(i.e. have it return all the way up to the thread's root function) and see
that there is no clobberation.  In the case of request threads, the stack
is not very deep and should look the same in every thread.  So you could
probably come up with some kludges for this particular case to check the
stack for what it ought to contain.

> This sounds horrible.  Is one consequence to increase the log buffer, so we
> see what the thread that comes next has done last time?  Is the thread that
> would handle the next request predictable or at least log-able?  Can we
> assign a thread to watch out for stack clobberations?  I really have no idea
> how to attack such a nasty problem.  How can stack clobberation occur? 

The simplest way for stack corruption to happen is by writing off the end
of a local array.  e.g.:

void foo () { char x[17]; memcpy(x, ptr, 25); }

This will clobber 8 bytes of stack above where the X array space was
pushed.  If those two words are e.g. the saved sp and pc values for this
frame, then the return from this function will set the sp to garbage and
then jump to garbage.

Obviously, it can be a lot more complicated than this.  But something bad
happening relative to a pointer to a local variable is a common paradigm.
It could also be storing through a wholly stray pointer (uninitialized
variable, etc) whose old value happens to be a pointer into the stack.

If you made your buffer really big, it might be that the trace of messages
would show us something particularly unusual that we could suspect.  But I
would not hold out a lot of hope that the problem will suddenly become
apparent just because we can see more of the past RPCs.

There isn't any reliable way to predict which thread will wake up to take
the next message from the portset.  The best thing I can think of is to
arrange that there only be one thread waiting in the portset at a time, and
that each thread completely unwind its stack and die after handling one
request.  Then it should crash immediately during that unwind if the
clobberation is of the sort I have been describing.  This will slow the
server down a lot.

ports_manage_port_operations_multithread creates a new request thread
whenever the last one is going into an RPC server work function.  So all
you should need to do is hack it so that each thread finishes after sending
its reply rather than going back into the receive loop.  Instead of
internal_demuxer incrementing NREQTHREADS and returning to
mach_msg_server_timeout, it can just send the reply message with a
send-only mach_msg call of its own and then bail out.  Probably what you
want to do is make a special version of mach_msg_server_timeout (from
libc/mach) and use that in manage-multithread.c:thread_function where the
call is now; modify it to just do a single receive-demux-reply and then
finish.  You might want to stick that function at the end of the file to
make sure it doesn't get inlined into thread_function, since that might
substantially change the stack layout so the whole crash scenario behaves

reply via email to

[Prev in Thread] Current Thread [Next in Thread]