pspp-dev
[Top][All Lists]

Re: Covariance Matrices

 From: Jason Stover Subject: Re: Covariance Matrices Date: Wed, 20 Aug 2008 10:57:33 -0400 User-agent: Mutt/1.5.18 (2008-05-17)

On Wed, Aug 20, 2008 at 08:45:00AM +0800, John Darrington wrote:
> I've been thinking how to implement factor analysis.
>
> One thing that's clearly required is a covariance matrix. We currently
> have src/math/covariance-matrix.[ch] which should do the job, but I
> think it can be improved.  The current interface has the function
>
>   void covariance_pass_two (struct design_matrix *cov,
>                           double mean1, double mean2,
>                           double ssize,
>                           const struct variable *v1,
>                           const struct variable *v2,
>                           const union value *val1,
>                           const union value *val2);

Yep, too many arguments.

> 1.  As I see it, mean1 and mean2 are not necessary.  The traditional
>     definition of cov(x,y) is
>      \sum{(x_i - \overbar{x})(y_i - \overbar{y})},
>       but this can be expanded to
>      \sum x_i y_i - \overbar{x}\sum y_i - \overbar{y}\sum x_i +
>      n\overbar{x}\overbar{y},
>      which doesn't have any \overbar components inside the \sum, so
>      the mean can be calculated as we go, and applied post hoc.

I had planned to add a one-pass algorithm, but used a two-pass algorithm
first because, "usually", two-pass algorithms have lower relative errors than
one-pass algorithms, according to

"Algorithms for Computing the Sample Variance: Analysis and Recommendations"
TF chan, G. Golub, R. LeVeque. American Statistician, v37 n3, 1983, pp. 242-247.

So I had planned to add a one-pass algorithm, based on the algorithms in that
paper, but never got around to it.

> 2.  I'm not sure what the ssize parameter is for.  It's only used in
>     the case where at least one variable is alpha (and I've not yet
>     worked out in my mind what meaning a covariance matrix for string
>     variables can have).  But in glm.q I see that it's being passed
>     the same value in all invocations.  Is this going to be generally
>     true?  If so, then we can make ssize a member of cov, and set it
>     at construction time.

That's a good idea. Here is why we need ssize in there (somewhere):

If a variable is alpha, it gets encoded as a bit vector, like this:

a ---> (1 0)
b ---> (0 1)
c ---> (0 0)

then this bit vector is used in computation of the covariance. That's
necessary for fitting a linear model with any factors.

The ssize argument is necessary since there is no "mean" stored for each
of the components of these vectors. The "mean" in this case is just the
proportion of 1's in each position of the vector. So I passed ssize to
covariance_pass_two() to compute those proportions.

> 3.  Rather than passing in val1 and val2, I suggest that we pass in
>     a (struct ccase *), and index into the values inside the function,
>     using v1 and v2.

OK.

> If these suggestions are implemented, then this function becomes
>
>   void covariance_accumulate (struct design_matrix *cov,
>                           const struct variable *v1,
>                           const struct variable *v2,
>                           const struct ccase *c);

That looks better.

> Typically, this will be called in two nested loops, thus:
>
>
>    {
>     for (i = 0; i < n_all_vars; ++i)
>     {
>       const struct variable *v = all_vars[i];
>       for (j = i; j < n_all_vars; j++)
>         {
>           const struct variable *w = all_vars[j];
>           covariance_accumulate (X, v, w, &c);
>         }
>     }
>  }
>
> But all all_vars is also passed in at construction time, so we might as well
> put the two loops inside the function, and not bother with v1 and v2
> as parameters.   Then we have simply:
>
>
>   void covariance_accumulate (struct design_matrix *cov,
>                           const struct ccase *c);
>
>
> I can see that there are a lot of procedures which will need
> covariance matrices, so the simpler we make their creation, the better.
>
>
> How many of these suggestions make sense?

I like the suggestion. The only reasons I never thought of adding
the struct ccase to the function's arguments were:

1. I didn't know about struct ccase, except in the rudimentary way I used
it in a procedure, and

2. Since I didn't know about it, I was afraid to make any code in src/math

My only suggestion is that the code in covariance-matrix.[ch] should have
both one- and two-pass algorithms. So maybe add covariance_accumulate()
and change covariance_pass_two () to incorporate your changes, but passing
an argument like double *means to use the means.

-Jason