pspp-dev
[Top][All Lists]

## Re: Optimisation of statistic calculations

 From: John Darrington Subject: Re: Optimisation of statistic calculations Date: Thu, 4 Nov 2004 07:25:05 +0800 User-agent: Mutt/1.3.28i

```On Wed, Nov 03, 2004 at 09:55:08AM -0800, Ben Pfaff wrote:

Redundant time, i.e. calculating the same thing more than once
for a single variable or data set, seems to me like a separate
issue that needs to be addressed separately.  I haven't thought
scheme for addressing this.  I assume that it's basically a
dependency graph: to calculate C we need to know A and B.  But
it's not obvious to me why we need two DAGs; a Makefile is the
same kind of dependency graph as far as I can tell and there's
only one DAG involved in that case.  Can you explain further?

Consider the case of calculating variance.  Variance is defined as

var = Sum (x_i - x_bar)^2 / n

where:
x_bar = Sum(x_i)/n

(We know there's a more efficient formula, but I'll ignore that for
the benefit of this explaination)

So var depends upon x_bar and n.  However it doesn't need to known n
until it has iterated through all the cases, whereas it needs to know
x_bar before it even starts the computation.    Therefore, we have a
dependency graph which looks like:

var        x_bar
|          /\
n         n  x_sum

(it happens to be disconnected in this case)
AND a pre_dependency graph:

var
|
x_bar

Like I said, there are ways to calculate variance that don't need any
pre_dependencies, but there are other stats for which there isn't.
The Levene statistic is one of them.  Thus, you will notice that
levene.c makes two passes through the data.

Does this make sense?

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://wwwkeys.pgp.net or any PGP keyserver for public key.

```

pgpac30u22ixQ.pgp
Description: PGP signature