gnugo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnugo-devel] statistical regression


From: Douglas Ridgway
Subject: [gnugo-devel] statistical regression
Date: Tue, 2 Mar 2004 16:26:44 -0700 (MST)

Hi all!

After reading some of the discussion on r.g.g. as to whether --level 15 
is any improvement over --level 10, I did some work on statistics. The 
question is, based on the results of a series of games, is player A 
stronger than player B. From the point of view of setting up the test, the 
question is how many games are necessary to identify a difference in 
strength of a given size. I think people here have also run such tests.

I constructed [1] a table using KGS's formula for converting a strength
difference in stones to probability of victory, allowing a 5% chance of
falsely identifying a difference when there is none, and a 10% chance of
missing a real difference at the stated mismatch. N is the number of games
that need to be played, and Nw is the number of games that the stronger
player must win to get declared stronger.

Stones  p       N       Nw
0.5     0.60    264     148
1.0     0.69    67      42
1.5     0.77    30      21
2.0     0.83    18      14
2.5     0.88    12      10
3.0     0.92    9       8

The results are interesting. For a short series, <=10 games, nothing less 
than a complete blowout is statistically significant, and we wouldn't 
expect to see that without a major difference in strength, perhaps 3 
stones. To identify a substantial strength difference, 1.5-2.0 stones, 
requires 20 or 30 games, and winning 2/3s of them. To be sure of a 
strength difference of less than a stone requires hundreds of games.

One idea is to check that a change at least hasn't made the program worse.  
The short series are so dominated by noise that they may not be worth
running at all. A run of 20 or 30 games, on the other hand, with a
required margin of victory of 2/3's, makes some sense. That at least gives
a 90% chance of catching a mistake that costs 1.5 to 2.0 stones, and some
chance of identifying smaller changes, positive or negative.

I tried 3.5.3 at --level 15 (always white, receiving 6.5 komi) against
--level 10.  Assuming I did it right [2], they split the series 10-10, 
indicating a strength difference of a stone or less, and no clue which one 
is stronger.

doug.
address@hidden

[1] For people who'd like to check the math, here's the Matlab code:

p = 1./(1+exp(-0.8*[0.5:0.5:3.0]))
Ns = ceil(((1.96*sqrt(p.*(1-p))+1.28*sqrt(.5*(1-.5)))./(p-.5)).^2)
Nw = binoinv(0.975, Ns, 0.5)+1


[2] Does the command line

 perl twogtp --white '/usr/local/bin/gnugo --mode
gtp --level 15' --black '/usr/local/bin/gnugo --mode gtp --level 10' 
--komi 6.5 --games 20 --sgffile filename.sgf

look about right?







reply via email to

[Prev in Thread] Current Thread [Next in Thread]