gnugo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnugo-devel] engine/influence.c (and DFA)


From: Arend Bayer
Subject: Re: [gnugo-devel] engine/influence.c (and DFA)
Date: Thu, 5 Sep 2002 20:41:28 +0200 (CEST)

On 5 Sep 2002, Dave Denholm wrote:

> Gunnar Farneback <address@hidden> writes:
>
> > Dave wrote:
> > >    I was just having a look at influence.c, since that is where the
> > > profile shows the most time being spent.
> >
> > On what platform and what exactly did you run? I just did a profile on
> > an AMD Athlon running neurogo.tst and then the influence code doesn't
> > take much time. There the pattern matching is in the top.
> >
> >   %   cumulative   self              self     total
> >  time   seconds   seconds    calls   s/call   s/call  name
> >  16.05     37.36    37.36 142878688     0.00     0.00  scan_for_patterns
> >   6.27     51.96    14.60 104763600     0.00     0.00  check_pattern_light
> >   5.08     63.79    11.83   179092     0.00     0.00  
> > compute_primary_domains
> >   3.85     72.76     8.97 64742873     0.00     0.00  fastlib
> >   3.05     79.87     7.11 30039637     0.00     0.00  do_play_move
> >   2.94     86.72     6.85    97175     0.00     0.00  do_push_owl
(...)
> >   1.46    128.20     3.39   157757     0.00     0.00  accumulate_influence
> >   [...]
> >
> > What we see here is probably that the neurogo test suite is rather owl
> > intensive. For a more representative profiling one should probably
> > replay a couple of entire games against varied opponents (self play
> > just doesn't suffice). The most straightforward way to do that would
> > be to have a GTP file simulating replay of a selection of games.

Also we see that indeed memory is becoming more of a bottle-neck (not
a surprise, of course): Also Athlon, but K6-2 400 MHz instead of
Gunnar's (1,6 GHz?):

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 10.88    111.74   111.74 136937536     0.00     0.00  scan_for_patterns
  5.53    168.53    56.79 100769412     0.00     0.00  check_pattern_light
  4.65    216.32    47.79   174590     0.00     0.00  compute_primary_domains
  3.44    251.63    35.31    94773     0.00     0.00  do_push_owl
  3.18    284.33    32.70 18600234     0.00     0.00  order_moves
  3.17    316.87    32.54 63926889     0.00     0.00  fastlib
  2.47    342.19    25.32 29608725     0.00     0.00  do_play_move
  2.28    365.56    23.37  2783429     0.00     0.00  
search_persistent_reading_cache
  2.23    388.43    22.87   156910     0.00     0.00  accumulate_influence
  2.18    410.77    22.34 17404893     0.00     0.00  assimilate_string
(...)

That makes me even wonder whether dfa is still a big speed up at all on
CPUs as fast as Gunnar's.

> But one new thought... do_push_owl() just pushes state ?  But looking
> at the structure, there is a lot of state to remember.
>
> The other owl functions don't appear anywhere significant in the profile.
Well, compute_primary_domains is an owl function.

> So what I'm thinking is : maybe the balance between saving state
> and recalculating state is too far towards saving state. Maybe
> if we saved less state, or spent more effort saving only what we need,
> or made the owl stuff work slightly harder to get hold of its data,
> then we'd get an overall speedup by making do_push_owl faster ?

I recently realized that pushing the eye state (which is the biggest
part of sizeof(local_owl_data) is unnecessary, it is completely
recalculated from scratch, just didn't get around yet to do anything
about it yet.

However, I wouldn't be sure that this would be a big speedup. Note that
we cannot overwrite the current state (we do still need the information for
selecting the next move to try when we return from deeper recursion
leves). A side effect of do_push_owl might be that the memory segment
to which we copy the data is now in the CPU cache, where we need it
anyway.

(Sometime around 3.1.2x I killed pop_owl, which unnecessarily did the
same amount of copying as do_push_owl is doing now. It turned out to be
only a <1% speedup, although pop_owl had some 3-4% CPU time before.)

Instead, one could substantially reduce sizeof(local_owl_data) (and thus
both speedup do_push_owl and gain overall cache benefits) by
- maintaining eyes in a separate list, i.e. not keeping the eye
  information copied at every
- further compressing the eye data (e.g. the 3bit information of the
  eye color doesn't need a whole "int", etc.)

Just a matter of doing it.

Arend






reply via email to

[Prev in Thread] Current Thread [Next in Thread]