gnugo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnugo-devel] Crash found in 3.4


From: Gunnar Farneback
Subject: Re: [gnugo-devel] Crash found in 3.4
Date: Tue, 23 Dec 2003 12:44:11 +0100
User-agent: EMH/1.14.1 SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.3 Emacs/20.7 (sparc-sun-solaris2.7) (with unibyte mode)

I wrote:
> I've tracked the bug down. More details tomorrow. For now I'll just
> say that the best fix is to rewrite the owl stack management to be
> less fragile.

At the bottom of the problem is the fact that the local owl data are
so large that we don't want to allocate more of it than necessary and
that we can't be sure beforehand how deep stack will be needed during
the recursive owl reading. It's also too large for handling on the C
stack on some platforms, which was what we once did. Thus we need to
dynamically allocate memory for it.

The next level of the problem is that the push_owl() implementation is
written so that in case the owl stack needs to be enlarged, it's done
through realloc(). This is reasonable per se, but has the unfortunate
side effect that existing owl data pointers may be invalidated if the
realloc() call is forced to move the data in memory. New valid
pointers are then obtained when calling pop_owl(). Also the push_owl()
call modifies the current pointer if needed, but other copies of that
pointer may become invalid.

This is where things become fragile since C doesn't have any support
mechanisms for invalidated pointers and we have to be very careful
with our code. What we missed is that do_owl_attack() and
do_owl_defend() are written in a way which such that we can't trust
the owl pointer sent to them after returning. Usually this is fine
because in the main usage pop_owl() will be called shortly afterwards
and we once more get a good pointer. However, near line 1594 of owl.c
in 3.4 (I'm off a few lines after having added debug output) there's a
call to do_owl_defend() after which we continue to use the now
possibly invalid owl pointer. This is definitely what caused the
assertion failure when run on Linux under valgrind. I'm not sure it
matches the Windows backtrace sent by Shimmie but it's fairly clear
that the moving around of the owl stack is the problem also in that
case.

There are several plausible solutions to this problem, including:

1. Rewriting do_owl_attack() and do_owl_defend() to take a pointer to
   the owl pointer.
2. Allocate all memory we may ever need for the owl stack at once.
3. Rewrite push_owl()/pop_owl() not to move memory around.
4. Redesign the owl data structures to be smaller or be incrementally
   updated. (Possibly combined with 2.)

The simplest solution is 2, but I don't consider it a reasonable waste
of memory. 1 is also reasonably easy but it doesn't solve the real
problem and we will almost certainly get into trouble with it again
later. The best solution is probably 4 but also the hardest one.

For now I'll propose solution 3, along the following lines:

* Set up a static array of owl data pointers, of a size which is
guaranteed to be sufficient. Initialize all pointers to NULL.
* Allocate each stack entry with malloc() when needed and leave them
around.

This way we don't have to move existing data around and don't waste
memory (more than at most a few kB for the static array).

/Gunnar




reply via email to

[Prev in Thread] Current Thread [Next in Thread]