gnugo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnugo-devel] engine/influence.c (and DFA)


From: Evan Berggren Daniel
Subject: Re: [gnugo-devel] engine/influence.c (and DFA)
Date: Thu, 5 Sep 2002 21:53:59 -0400 (EDT)

On Thu, 5 Sep 2002, Arend Bayer wrote:

>
> On 5 Sep 2002, Dave Denholm wrote:
>
> > Gunnar Farneback <address@hidden> writes:
> >
> > > Dave wrote:
> > > >    I was just having a look at influence.c, since that is where the
> > > > profile shows the most time being spent.
> > >
> > > On what platform and what exactly did you run? I just did a profile on
> > > an AMD Athlon running neurogo.tst and then the influence code doesn't
> > > take much time. There the pattern matching is in the top.
> > >
> > >   %   cumulative   self              self     total
> > >  time   seconds   seconds    calls   s/call   s/call  name
> > >  16.05     37.36    37.36 142878688     0.00     0.00  scan_for_patterns
> > >   6.27     51.96    14.60 104763600     0.00     0.00  check_pattern_light
> > >   5.08     63.79    11.83   179092     0.00     0.00  
> > > compute_primary_domains
> > >   3.85     72.76     8.97 64742873     0.00     0.00  fastlib
> > >   3.05     79.87     7.11 30039637     0.00     0.00  do_play_move
> > >   2.94     86.72     6.85    97175     0.00     0.00  do_push_owl
> (...)
> > >   1.46    128.20     3.39   157757     0.00     0.00  accumulate_influence
> > >   [...]
> > >
> > > What we see here is probably that the neurogo test suite is rather owl
> > > intensive. For a more representative profiling one should probably
> > > replay a couple of entire games against varied opponents (self play
> > > just doesn't suffice). The most straightforward way to do that would
> > > be to have a GTP file simulating replay of a selection of games.
>
> Also we see that indeed memory is becoming more of a bottle-neck (not
> a surprise, of course): Also Athlon, but K6-2 400 MHz instead of
> Gunnar's (1,6 GHz?):
>
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  10.88    111.74   111.74 136937536     0.00     0.00  scan_for_patterns
>   5.53    168.53    56.79 100769412     0.00     0.00  check_pattern_light
>   4.65    216.32    47.79   174590     0.00     0.00  compute_primary_domains
>   3.44    251.63    35.31    94773     0.00     0.00  do_push_owl
>   3.18    284.33    32.70 18600234     0.00     0.00  order_moves
>   3.17    316.87    32.54 63926889     0.00     0.00  fastlib
>   2.47    342.19    25.32 29608725     0.00     0.00  do_play_move
>   2.28    365.56    23.37  2783429     0.00     0.00  
> search_persistent_reading_cache
>   2.23    388.43    22.87   156910     0.00     0.00  accumulate_influence
>   2.18    410.77    22.34 17404893     0.00     0.00  assimilate_string
> (...)
>
> That makes me even wonder whether dfa is still a big speed up at all on
> CPUs as fast as Gunnar's.
>
> > But one new thought... do_push_owl() just pushes state ?  But looking
> > at the structure, there is a lot of state to remember.
> >
> > The other owl functions don't appear anywhere significant in the profile.
> Well, compute_primary_domains is an owl function.
>
> > So what I'm thinking is : maybe the balance between saving state
> > and recalculating state is too far towards saving state. Maybe
> > if we saved less state, or spent more effort saving only what we need,
> > or made the owl stuff work slightly harder to get hold of its data,
> > then we'd get an overall speedup by making do_push_owl faster ?
>
> I recently realized that pushing the eye state (which is the biggest
> part of sizeof(local_owl_data) is unnecessary, it is completely
> recalculated from scratch, just didn't get around yet to do anything
> about it yet.
>
> However, I wouldn't be sure that this would be a big speedup. Note that
> we cannot overwrite the current state (we do still need the information for
> selecting the next move to try when we return from deeper recursion
> leves). A side effect of do_push_owl might be that the memory segment
> to which we copy the data is now in the CPU cache, where we need it
> anyway.
>
> (Sometime around 3.1.2x I killed pop_owl, which unnecessarily did the
> same amount of copying as do_push_owl is doing now. It turned out to be
> only a <1% speedup, although pop_owl had some 3-4% CPU time before.)

Sounds like what we have is a memory allocation issue -- ie, most of the
time spent in that function is really spent on allocating memory.
Therefore, if we know at runtime how many owl_nodes we could possibly
have, would it make sense to allocate that many initially and then keep
track of them ourselves, instead of using malloc() and free()?  Just a
thought, feel free to ignore it if the fact that I haven't looked at the
code means I have no clue ;)


> Instead, one could substantially reduce sizeof(local_owl_data) (and thus
> both speedup do_push_owl and gain overall cache benefits) by
> - maintaining eyes in a separate list, i.e. not keeping the eye
>   information copied at every
> - further compressing the eye data (e.g. the 3bit information of the
>   eye color doesn't need a whole "int", etc.)

Sounds good.  We might want to make things #define-able, because in some
cases I think it is faster to move around and do math on full-size values.

Does this sound at all on-target?

Thanks

Evan Daniel





reply via email to

[Prev in Thread] Current Thread [Next in Thread]