guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help needed debugging segfault with Guile 1.8.7


From: Peter Brett
Subject: Re: Help needed debugging segfault with Guile 1.8.7
Date: Thu, 11 Nov 2010 14:22:23 +0000
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)

Thien-Thi Nguyen <address@hidden> writes:

> () Peter Brett <address@hidden>
> () Thu, 11 Nov 2010 10:52:41 +0000
>
>>    stupid logic error in some weak ref management code
>
> Could you please describe this error?
>

Sure.  libgeda uses direct management of memory, and the structures used
in its document object model need to be explicitly deleted when finished
with.  I decided to use a Guile smob to represent these structures for
access from Scheme code, with the pointer to the actual structure in
SCM_SMOB_DATA and with the low nibble of SCM_SMOB_FLAGS indicating which
type of DOM structure the smob references.

This would have been sufficient if Scheme code had only been working
with libgeda DOMs created and managed entirely via Scheme code.
However, here Guile is being used simply to provide extensibility to
electronics engineering applications based on libgeda, such as gschem.
It would theoretically be possible for the following sequence of events
to occur:

 1. In a Scheme function called from the schematic editor, a transistor
    is instantiated, added to the current page, and also stashed
    somewhere in the Guile environment.

 2. A bit later, the user closes the page.  It is destroyed from C code,
    and so is the transistor instance.

 3. Finally, a Scheme function is called that unstashes the transistor
    instance, and tries to use it, leading to a segfault.

There were two main design considerations taken into account when
looking for a solution to this problem.  Firstly, I wanted it to be
impossible to make libgeda leak memory from Scheme code, so that doing
something like

   (do ((i 1 (1+ i)) ((> i 1000000)))
       (make-transistor))

would be safe.  That meant that it had to be possible to destroy DOM
structures from the smob_free() function.  On the other hand, I wanted
to find a solution that avoided adding explicit Guile dependencies to
the core of libgeda (since I hope to split off the Scheme binding into a
separate library at some point).

The solution was to add weak reference facilities to the libgeda DOM
data structures.  A weak reference is added by calling a function
similar to:

  object_weak_ref (OBJECT *object, void (*notify_func)(void *, void *),
                   void *user_data);

The notify_func and user_data are prepended to a singly-linked list in
the OBJECT structure.  When the OBJECT is deleted via the C API, each
entry in the list is alerted by calling:

  notify_func (object, user_data)

The notification function I use for the weak references held by smobs
looks like this:

  static void
  smob_weakref_notify (void *target, void *smob) {
    SCM s = (SCM) smob;
    SCM_SET_SMOB_DATA (s, NULL);
  }

I've provided wrapper functions & macros for checking smob validity
(i.e. non-null smob data) before allowing any dereference of
e.g. (OBJECT *) SCM_SMOB_DATA (smob).

To allow garbage collection of DOM element smobs where possible, I use
bit 4 of the SCM_SMOB_FLAGS as a `GC allowed' flag.  Any API functions
that put a DOM element in a state where it may be destroyed from C code
rather than the smob_free() function are required to clear the flag (and
vice versa).

So, where was the bug?  When a smob is GC'd, and if the pointer it
contains hasn't already been cleared, it calls:

  object_weak_unref (SCM_SMOB_DATA (smob), smob_weakref_notify, smob);

Before I fixed the bug, object_weak_unref() contained code that looked
something like this:

  for (iter = weak_refs; iter != NULL; iter = list_next (iter)) {
    struct WeakRef *entry = iter->data;
    if ((entry->notify_func == notify_func) &&
        (entry->user_data != user_data)) { // ERROR: != should be ==
      free (entry);
      iter->data = NULL;
    }
  }
  weak_refs = list_remove_all (weak_refs, NULL);

Now, how does this result in Guile GC freelist corruption?  It requires
two smobs to be created for the same DOM structure S, let's say A and B.
(This can only occur if S is being managed from C code, so we know that
the `GC allowed' flag will be cleared).

            smob_addr      cell CAR         cell CDR
     A      0x1000         0x0              <ptr to S>
     B      0x1008         0x0              <ptr to S>

     Weakref user_data for S: 0x1008, 0x1000

Now, smob A is garbage collected.  Because we've told smob_free that
it's not allowed to destroy S, the smob_free() function calls
object_weak_unref().  Since the latter is broken, it clears the wrong
weak reference.  Now things look like this:

            smob_addr      cell CAR         cell CDR
     A      0x1000         <magic number>   <ptr to next free cell>
     B      0x1008         0x0              <ptr to S>

     Weakref user_data for S: 0x1000

Some time later, S is destroyed from C code, and this results in the
smob_weakref_notify() function described earlier being called thus:

   smob_weakref_notify (<pointer to S>, 0x1000)

After S is destroyed, things look like this:

            smob_addr      cell CAR         cell CDR
     A      0x1000         <magic number>   0x0 (OH NOES)
     B      0x1008         0x0              <ptr to S>

At this stage, the freelist has become corrupted, and will result in a
segfault in scm_cell() at some indeterminate future time.

With the fix in this commit:

  http://git.gpleda.org/?p=gaf.git;a=commit;h=41ea61b2f156

this memory corruption does not occur.

I hope that explained things reasonably precisely!

Regards,

                                 Peter

-- 
Peter Brett <address@hidden>
Remote Sensing Research Group
Surrey Space Centre




reply via email to

[Prev in Thread] Current Thread [Next in Thread]