pspp-dev
[Top][All Lists]

## Re: moments.c

 From: Ben Pfaff Subject: Re: moments.c Date: Wed, 21 Mar 2007 20:43:37 -0700 User-agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)

```Jason Stover <address@hidden> writes:

> On Wed, Mar 21, 2007 at 02:27:22PM -0700, Ben Pfaff wrote:
>> If correlations and moments are separate, you can compute them in
>> a single pass like this:
>>
>>         for each case C do:
>
> But correlations_add() would need to know what happened in
> moments1_add(). Otherwise its computations would be redundant because
> it would have to compute the mean on its own.
>
> Correlations depend on the means. Computing correlations requires
> either computing the means first, then the correlation (meaning an
> extra data pass), or using a special algorithm that computes both
> simultaneously. Since the computations of correlations and means are
> intertwined in a one-pass algorithm, I thought the computation of
> correlation should be in moments.c. If it isn't in moments.c, then
> anyone who wants to compute correlations would have to do so by
> writing their own functions to manipulate the moments struct in the
> loop, or forgoing the use of moments.c and re-writing their own
> functions to compute means.
>
> An algorithm for computing correlations with one pass was recently
> added to the GSL cvs repository. You can see it at
> http://sourceware.org/cgi-bin/cvsweb.cgi/gsl/?cvsroot=gsl in
> statistics/covariance.c.

OK, here's the core of the correlation code from that file:

long double sum_xsq = 0.0;
long double sum_ysq = 0.0;
long double sum_cross = 0.0;
...
mean_x = data1[0 * stride1];
mean_y = data2[0 * stride2];

for (i = 1; i < n; ++i)
{
ratio = i / (i + 1.0);
delta_x = data1[i * stride1] - mean_x;
delta_y = data2[i * stride2] - mean_y;
sum_xsq += delta_x * delta_x * ratio;
sum_ysq += delta_y * delta_y * ratio;
sum_cross += delta_x * delta_y * ratio;
mean_x += delta_x / (i + 1.0);
mean_y += delta_y / (i + 1.0);
}

r = sum_cross / (sqrt(sum_xsq) * sqrt(sum_ysq));

We can break this into three functions, one for initializing
variables, one for processing each case, one for producing the
output, pretty easily:

correlation_create():
long double sum_xsq = 0.0;
long double sum_ysq = 0.0;
long double sum_cross = 0.0;
i = 0;

i++;
ratio = i / (i + 1.0);
delta_x = data1[i * stride1] - moments1_mean(m1_x);
delta_y = data2[i * stride2] - moments1_mean(m1_y);
sum_xsq += delta_x * delta_x * ratio;
sum_ysq += delta_y * delta_y * ratio;
sum_cross += delta_x * delta_y * ratio;

correlation_calculate():
return sum_cross / (sqrt(sum_xsq) * sqrt(sum_ysq));

(I'm assuming that we write a function moments1_mean() that does
the same as moments1_calculate() except that it just returns the
mean.)

And then the main loop becomes:

moments1_create()
correlation_create()
for each case C do:
get x and y from case
moments1_calculate()
correlation_calculate()

This works, no?  And then correlations doesn't have a dependency
on moments1 except as a user.  We could even get rid of that
minor dependency by changing the interface from taking the
moments1 objects to taking the value of the mean.

Let me know what you think.
--
Ben Pfaff