[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Optimisation of statistic calculations
From: 
John Darrington 
Subject: 
Re: Optimisation of statistic calculations 
Date: 
Thu, 4 Nov 2004 07:25:05 +0800 
Useragent: 
Mutt/1.3.28i 
On Wed, Nov 03, 2004 at 09:55:08AM 0800, Ben Pfaff wrote:
Redundant time, i.e. calculating the same thing more than once
for a single variable or data set, seems to me like a separate
issue that needs to be addressed separately. I haven't thought
about this much at all. Your DAG sounds like a pretty general
scheme for addressing this. I assume that it's basically a
dependency graph: to calculate C we need to know A and B. But
it's not obvious to me why we need two DAGs; a Makefile is the
same kind of dependency graph as far as I can tell and there's
only one DAG involved in that case. Can you explain further?
Consider the case of calculating variance. Variance is defined as
var = Sum (x_i  x_bar)^2 / n
where:
x_bar = Sum(x_i)/n
(We know there's a more efficient formula, but I'll ignore that for
the benefit of this explaination)
So var depends upon x_bar and n. However it doesn't need to known n
until it has iterated through all the cases, whereas it needs to know
x_bar before it even starts the computation. Therefore, we have a
dependency graph which looks like:
var x_bar
 /\
n n x_sum
(it happens to be disconnected in this case)
AND a pre_dependency graph:
var

x_bar
Like I said, there are ways to calculate variance that don't need any
pre_dependencies, but there are other stats for which there isn't.
The Levene statistic is one of them. Thus, you will notice that
levene.c makes two passes through the data.
Does this make sense?

PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://wwwkeys.pgp.net or any PGP keyserver for public key.
pgpac30u22ixQ.pgp
Description: PGP signature