dotgnu-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [DotGNU]Hanging problem


From: Thong (Tum) Nguyen
Subject: RE: [DotGNU]Hanging problem
Date: Fri, 7 May 2004 15:10:50 +1200

Hi Russell,

If you throw a test case my way I can give debugging it a go.  This isn't
the first time weird things like that have happened.  There were some bugs
in wakeup.c that sometimes caused threads to miss conditional variable
signals.

BTW, my monitors are now faster than yours again ;-).  I've tidied up a lot
of code and simplified the algorithm tremendously (having GC allocated
monitors is a god send -- should have listened to you earlier ;)).  Even
with a global lock on the CAX call, the algorithm is faster.  With assembly
optimised CAX, the algorithm is obviously even faster.

I think my algorithm is even simpler than yours now.  I'll have something
for you to look at soon.  There's so many other changes (like to the
interrupt/abort code) that it'll take me a bit longer to put it all together
and document it.

All the best,

^Tum

> -----Original Message-----
> From: Russell Stuart [mailto:address@hidden
> Sent: Friday, 7 May 2004 1:53 p.m.
> To: Thong (Tum) Nguyen
> Cc: DotGnu-Develop
> Subject: [DotGNU]Hanging problem
> 
> This hanging problem is still with me.  I still have no idea what causes
> it.  I have spent the last few hours in gdb. I am telling what I have
> found in the hope you can spot something I haven't.
> 
> 1.  The program has 6 threads - at least that is what I can see in
>     gdb.  I can only guess at what they are from looking at the gdb
>     backtraces - it is a bit hard to tell as I haven't figured out
>     how to get a PNet C# backtrace from gdb.
> 
>       a.  The pthread manager thread.  I don't know what it does,
>           but I presume it does not figure in this problem.
> 
>       b.  The PNet GC thread.  Ditto.
> 
>       c.  The System.Threading.Timer thread.  I re-wrote this class
>           when I found it had a lot of bugs.  The patch is currently
>           in the savannah patch manager as I haven't got around to
>           writing tests for it, so you can look at it if you want to.
>           It is sitting in a Monitor.Wait(Object, int), as it should
>           be.  Ie, it holds no locks.  The reason I am fairly sure
>           this is timer thread is Timer.cs is the only place that does
>           a Monitor.Wait(), AFAICT.
> 
>       d.  A thread sitting in a WaitHandle.WaitOne().  There is only one
>           possibility, as there is only one place that does this sort of
>           call - a background thread of mine that sends packets. Its
>           code looks roughly like this:
>             for (;;) {
>               autoResetEvent.WaitOne();
>               for (;;) {
>                 lock (this) packet = getPacketOffQueue();
>                 if (packet == null) break;
>                 socket.send(packet);
>               }
>             }
>           So it holds no locks.
> 
>       e.  A thread blocked on a socket read.  This is in my code.  It
>           is a background thread that roughly does this:
>             for (;;) {
>                lock (this) check for exit;
>                socket.receive_from(packet, ...);
>                lock (this) processPacket = this.processPacketDelegate;
>                if (processPacket != null) processPacket(packet);
>             }
>           So it also holds no locks.
> 
>       f.  Finally, we come to the thread that is hung.  It is the main
>           thread, actually.  It is sitting in a Monitor.Enter(),
>           blocked.  Given that none of the other threads are holding a
>           lock this is wrong, obviously.
> 
> 2.  The question that does spring to mind is how can I be sure no
>     other thread holds a lock on a monitor.  Well, nowhere in my code
>     do I use anything other than "lock (..) ...".  Nowhere do I call
>     Thread.Interrupt() or Thread.Abort().  In other words, there is
>     nowhere that a Monitor.Enter() can happen without a matching
>     Monitor.Exit().
> 
> 3.  It now reliably fails on every machine I run it on.  Single CPU.
>     Multi CPU.  Hyper-threaded.  Various kernels.  RH 7.2 and 8.0.
> 
> 3.  In trying to figure out why the Monitor.Enter has blocked, I tried
>     a few things.  Firstly, I altered ilrun to throw an exception when
>     it blocked, thus giving me a C# back trace.  I know know that
>     thread holds no other locks.
> 
>     Secondly, with gdb I looked at the internal ilrun structures.  This
>     is what I found:
> 
>       - My monitor's enterCount was 2.  It can only be 2 if there is
>         another unmatched call to Monitor.Enter().  There aren't any,
>         as I have shown.
> 
>       - The monitor->waitHandle->parent.owner is not 0, which would
>         have to be the case since ILWaitMonitorTryEnter is blocking.
>         The owning thread is thread (e) above.  This makes some small
>         degree of sense as thread (e) would grab the monitor in
>         question from time to time as packets are processed.
> 
> So what I have now is two independent sources (my enterCount and
> you "owner" field) telling me the monitor is currently locked.
> Surely this must mean that Monitor.Exit() was not called, or if
> is was called it didn't work.  One argument against the "didn't"
> work theory is that I have two different implementations of
> Monitor.Exit() written by two programmers - you and me.  And it
> fails with both of them.
> 
> However, I put a call the the Unix "abort()" function on every
> possible route through _IL_Monitor_Exit that did not unlock the
> monitor.  It was never hit.  Ergo I can only conclude that every
> call to Monitor.Exit() successfully decremented enterCount and
> unlocked the underlying mutex.
> 
> So then I decided that perhaps an exception was being thrown
> while this object was locked, and somehow the Monitor.Exit()
> wasn't being executed.  So, I added a "locked object count"
> to each thread (the ILExecThread structure, actually).  When
> an object successfully called _IL_Monitor_InternalTryEnter it
> was incremented, and when it successfully called
> _IL_Monitor_Exit it was decremented.  So it was only 0 when
> no locks were held.  Then I altered engine/throw.c to contain
> this code:
>   void ILExecThreadSetException(ILExecThread *thread, ILObject *obj)
>   {
>     if (thread->lockCount != 0)  // @@@
>       abort(); // @@@
>     thread->thrownException = obj;
>   }
> 
> The abort() call was never hit.  Ergo, an exception was never thrown
> while a monitor lock was held, so an exception could not be the
> cause of the problem.
> 
> I am now at a total loss.  I have no idea what I am seeing could be
> possible, and can see no way forward.  Any ideas?
> 
> 
> _______________________________________________
> Developers mailing list
> address@hidden
> http://dotgnu.org/mailman/listinfo/developers



reply via email to

[Prev in Thread] Current Thread [Next in Thread]