pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: moments.c


From: Ben Pfaff
Subject: Re: moments.c
Date: Wed, 21 Mar 2007 20:43:37 -0700
User-agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)

Jason Stover <address@hidden> writes:

> On Wed, Mar 21, 2007 at 02:27:22PM -0700, Ben Pfaff wrote:
>> If correlations and moments are separate, you can compute them in
>> a single pass like this:
>> 
>>         for each case C do:
>>             moments1_add(C)
>>             correlations_add(C)
>
> But correlations_add() would need to know what happened in
> moments1_add(). Otherwise its computations would be redundant because
> it would have to compute the mean on its own.
>
> Correlations depend on the means. Computing correlations requires
> either computing the means first, then the correlation (meaning an
> extra data pass), or using a special algorithm that computes both
> simultaneously. Since the computations of correlations and means are
> intertwined in a one-pass algorithm, I thought the computation of 
> correlation should be in moments.c. If it isn't in moments.c, then
> anyone who wants to compute correlations would have to do so by
> writing their own functions to manipulate the moments struct in the
> loop, or forgoing the use of moments.c and re-writing their own
> functions to compute means.
>
> An algorithm for computing correlations with one pass was recently
> added to the GSL cvs repository. You can see it at 
> http://sourceware.org/cgi-bin/cvsweb.cgi/gsl/?cvsroot=gsl in
> statistics/covariance.c.

OK, here's the core of the correlation code from that file:

  long double sum_xsq = 0.0;
  long double sum_ysq = 0.0;
  long double sum_cross = 0.0;
...
  mean_x = data1[0 * stride1];
  mean_y = data2[0 * stride2];

  for (i = 1; i < n; ++i)
    {
      ratio = i / (i + 1.0);
      delta_x = data1[i * stride1] - mean_x;
      delta_y = data2[i * stride2] - mean_y;
      sum_xsq += delta_x * delta_x * ratio;
      sum_ysq += delta_y * delta_y * ratio;
      sum_cross += delta_x * delta_y * ratio;
      mean_x += delta_x / (i + 1.0);
      mean_y += delta_y / (i + 1.0);
    }

  r = sum_cross / (sqrt(sum_xsq) * sqrt(sum_ysq));

We can break this into three functions, one for initializing
variables, one for processing each case, one for producing the
output, pretty easily:

  correlation_create():
      long double sum_xsq = 0.0;
      long double sum_ysq = 0.0;
      long double sum_cross = 0.0;
      i = 0;

  correlation_add(x, m1_x, y, m1_y):
      i++;
      ratio = i / (i + 1.0);
      delta_x = data1[i * stride1] - moments1_mean(m1_x);
      delta_y = data2[i * stride2] - moments1_mean(m1_y);
      sum_xsq += delta_x * delta_x * ratio;
      sum_ysq += delta_y * delta_y * ratio;
      sum_cross += delta_x * delta_y * ratio;

  correlation_calculate():
      return sum_cross / (sqrt(sum_xsq) * sqrt(sum_ysq));

(I'm assuming that we write a function moments1_mean() that does
the same as moments1_calculate() except that it just returns the
mean.)

And then the main loop becomes:

  moments1_create()
  correlation_create()
  for each case C do:
      get x and y from case
      moments1_add(&m1_x, x)
      moments1_add(&m1_y, y)
      correlations_add(x, &m1_x, y, &m1_y)
  moments1_calculate()
  correlation_calculate()

This works, no?  And then correlations doesn't have a dependency
on moments1 except as a user.  We could even get rid of that
minor dependency by changing the interface from taking the
moments1 objects to taking the value of the mean.

Let me know what you think.
-- 
Ben Pfaff 
address@hidden
http://benpfaff.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]