wesnoth-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wesnoth-dev] threading issues and Mac OS X


From: ott
Subject: [Wesnoth-dev] threading issues and Mac OS X
Date: Thu, 1 Sep 2005 22:55:30 +0200
User-agent: Mutt/1.5.10i

Since the IRC channel seems to have become less effective as a way
to discuss dev issues due to timezone disconnects and busy schedules,
here is my summary of the "hang on connecting to server" issue.

The original bug, filed for Debian unstable and BfW 0.8.11 in April 2005
    http://savannah.nongnu.org/bugs/?func=detailitem&item_id=12614
with title "wesnoth may lock when connecting to server" was apparently
fixed by Sirp in June, so I closed it.  It might be worth reopening this,
although it might be a completely different bug.

According to Ivanovic, Mac OS X has been having problems connecting to
the MP server for a while.  Apparently, around 10% of connection attempts
would hang on the "Connecting to Server..." dialog, a figure that has
crept up to about 30% in the last while.  I did some testing with 0.9.4
and subsequently, and can confirm the 30% estimate as quite accurate.
Note that this also applies to connecting to the campaign server.
Forcemaster changed the networking code during the last month so that
at least cancelling out of hung connections is now possible, although
some of the changes may have made the connection failure rate even worse.
When I see a "hung connection" it will stay hung until I quit out of it;
I have left it for up to 3h and can confirm that netstat shows no open
sockets at this stage while the code is still waiting.

Doing some debugging with gdb, it seems that the connection hang is
always happening in the same place, and the most plausible suspicion we
have (or rather, that forcemaster has advanced) is that one thread is
waiting for an event that has already happened.  With 0.9.4+cvs (see the
IRC log for 07 Aug 2005), placing a breakpoint just before the call to
notify_finished() in network.cpp made my connection problems go away.
More recently, the latest round of network code changes seem to have
caused the hang-on-connect rate to approach 100%; sanna reported 1
success in 20 tries.  Sanna later also reported
    one failure to connect, out of 10 tries...  [...]  all I personally
    changed was that I added a LOG_G logstream to thread.cpp, and in
    async_operation::execute, I added LOG_G << "WAIT_TIMEOUT_RESULT:
    " << res << "\n"; in the while(wait.process() == waiter::WAIT) loop
which sounds quite similar to the add-a-breakpoint "technique" of
resolving the deadlock (if this is indeed a deadlock).

Forcemaster tried to reduce the mutual exclusion part of the code to a
simple C program; this runs without a hitch on my machine (I ran it 5
million times without any problems).

I looked at the SDL code to see if it does some Mac-specific things, and
found that it does.  It forces the use of a bunch of very dubious-looking
semaphore code, based on what seems to have been a broken Mac OS X beta
released in late 2000.  I hacked around these "fixes" as per this patch,
and have seen the hang only once in 50 connection attempts:

diff -ur SDL-1.2.8/configure.in SDL-1.2.8.mod/configure.in
--- SDL-1.2.8/configure.in      Mon Dec 13 11:02:08 2004
+++ SDL-1.2.8.mod/configure.in  Mon Aug 15 11:39:22 2005
@@ -1337,7 +1337,7 @@
             # Some systems have broken recursive mutex implementations
             case "$target" in
                 *-*-darwin*)
-                    has_recursive_mutexes=no
+                    has_recursive_mutexes=yes
                     ;;
                 *-*-solaris*)
                     has_recursive_mutexes=no
diff -ur SDL-1.2.8/src/thread/linux/SDL_syssem.c 
SDL-1.2.8.mod/src/thread/linux/SDL_syssem.c
--- SDL-1.2.8/src/thread/linux/SDL_syssem.c     Wed Feb 18 19:22:03 2004
+++ SDL-1.2.8.mod/src/thread/linux/SDL_syssem.c Mon Aug 15 11:47:55 2005
@@ -62,7 +62,8 @@
 #ifdef MACOSX
 #define USE_NAMED_SEMAPHORES
 /* Broken sem_getvalue() in MacOS X Public Beta */
-#define BROKEN_SEMGETVALUE
+/* don't break Mac OS X just to support some 4 year old beta! */
+/* #define BROKEN_SEMGETVALUE */
 #endif /* MACOSX */
 
 struct SDL_semaphore {
diff -ur SDL-1.2.8/src/thread/linux/SDL_systhread.c 
SDL-1.2.8.mod/src/thread/linux/SDL_systhread.c
--- SDL-1.2.8/src/thread/linux/SDL_systhread.c  Wed Feb 18 19:22:03 2004
+++ SDL-1.2.8.mod/src/thread/linux/SDL_systhread.c      Mon Aug 15 11:49:04 2005
@@ -61,13 +61,11 @@
 
 #include <signal.h>
 
-#if !defined(MACOSX) /* pthread_sigmask seems to be missing on MacOS X? */
 /* List of signals to mask in the subthreads */
 static int sig_list[] = {
        SIGHUP, SIGINT, SIGQUIT, SIGPIPE, SIGALRM, SIGTERM, SIGCHLD, SIGWINCH,
        SIGVTALRM, SIGPROF, 0
 };
-#endif /* !MACOSX */
 
 #ifdef SDL_USE_PTHREADS
 
@@ -102,7 +100,6 @@
 
 void SDL_SYS_SetupThread(void)
 {
-#if !defined(MACOSX) /* pthread_sigmask seems to be missing on MacOS X? */
        int i;
        sigset_t mask;
 
@@ -112,7 +109,6 @@
                sigaddset(&mask, sig_list[i]);
        }
        pthread_sigmask(SIG_BLOCK, &mask, 0);
-#endif /* !MACOSX */
 
 #ifdef PTHREAD_CANCEL_ASYNCHRONOUS
        /* Allow ourselves to be asynchronously cancelled */

The SDL code has a bunch of hardcoded hacks.  These are identical in
1.2.9, by the way.

First, it forces any Mac to use the PTHREAD_NO_RECURSIVE_MUTEX kludge
(see the first snippet of the patch), which seems unnecessary on my
Mac OS X 10.3.9 system.  It is possible that there was a broken early
Mac OS X system for which this was necessary, but surely the test case
demonstrating the broken behaviour could have been put into configure.in
instead of hardcoding this.

Second, MACOSX is defined for any system for which config.guess returns
*-*-darwin* (which is the case for any Mac OS X system).  This is
used in several locations in the code for conditional compilation.
Most problematic of the conditional sections is the forced definition
of BROKEN_SEMGETVALUE in thread/linux/SDL_syssem.c (yes, this is the
version that is chosen on Mac OS X, since this system uses pthreads).
This substitutes a quick-and-dirty non-atomic hack instead of using
the standard semaphore code.  Note that I also tried without the
USE_NAMED_SEMAPHORE flag but the resulting code deadlocked every time
so I left that in.  Finally, MACOSX forces the use of some workarounds
based on the premise that pthread_sigmask() is missing on Mac OS X,
which is certainly not the case for 10.3.9.

That's enough of a rant about unnecessary SDL brokenness on Mac OS X.

The original bug #12614 was filed against Debian, and the forum is
replete with complaints about 0.9.6 on Windows (campaign server issues,
but also some comments that sound very similar to the hang-on-connect
problem for MP).  This leads me to believe that either:

1. bug #12614 and its offspring are still around, and represent long
standing problems on all platforms, with some platforms being more
susceptible due to eg. ordering of code, optimization quirks, or subtle
threading implementation issues -- the issues reported in the forums
further show that the problem is widespread and not platform specific,

OR

2. bug #12614 is unrelated (and hopefully fixed), and Mac OS X is having
some threading issues, possibly related to a broken implementation
of semaphores in SDL for Mac OS X; while the forum issues reported on
Windows and Linux are a different problem altogether.

I would have been all over the SDL people about this if I hadn't seen
that one hang in my testing.  Even one out of 50 is a problem, if this
is a threading issue.  My best guess right now is that some slight
reordering of some of the critical events or changes to the timing of
various actions has exposed a potential deadlock, which is made much
worse on Mac OS X due to some of the kludges which SDL imposes for this
platform.  My memory of dealing with operating system deadlocks is quite
ancient and doesn't tie in well with how we use threading in our code,
so I now leave this issue in the capable hands of someone who actually
knows what they are doing.  I hope this summary is useful, in any case.

I have joined the SDL mailing list and will attempt to help to improve
SDL threading on Mac OS X in the next release.

-- address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]