pspp-dev
[Top][All Lists]

## Covariance Matrices

 From: John Darrington Subject: Covariance Matrices Date: Wed, 20 Aug 2008 08:45:00 +0800 User-agent: Mutt/1.5.13 (2006-08-11)

I've been thinking how to implement factor analysis.

One thing that's clearly required is a covariance matrix. We currently
have src/math/covariance-matrix.[ch] which should do the job, but I
think it can be improved.  The current interface has the function

void covariance_pass_two (struct design_matrix *cov,
double mean1, double mean2,
double ssize,
const struct variable *v1,
const struct variable *v2,
const union value *val1,
const union value *val2);

1.  As I see it, mean1 and mean2 are not necessary.  The traditional
definition of cov(x,y) is
\sum{(x_i - \overbar{x})(y_i - \overbar{y})},
but this can be expanded to
\sum x_i y_i - \overbar{x}\sum y_i - \overbar{y}\sum x_i +
n\overbar{x}\overbar{y},
which doesn't have any \overbar components inside the \sum, so
the mean can be calculated as we go, and applied post hoc.

2.  I'm not sure what the ssize parameter is for.  It's only used in
the case where at least one variable is alpha (and I've not yet
worked out in my mind what meaning a covariance matrix for string
variables can have).  But in glm.q I see that it's being passed
the same value in all invocations.  Is this going to be generally
true?  If so, then we can make ssize a member of cov, and set it
at construction time.

3.  Rather than passing in val1 and val2, I suggest that we pass in
a (struct ccase *), and index into the values inside the function,
using v1 and v2.

If these suggestions are implemented, then this function becomes

void covariance_accumulate (struct design_matrix *cov,
const struct variable *v1,
const struct variable *v2,
const struct ccase *c);

Typically, this will be called in two nested loops, thus:

{
for (i = 0; i < n_all_vars; ++i)
{
const struct variable *v = all_vars[i];
for (j = i; j < n_all_vars; j++)
{
const struct variable *w = all_vars[j];
covariance_accumulate (X, v, w, &c);
}
}
}

But all all_vars is also passed in at construction time, so we might as well
put the two loops inside the function, and not bother with v1 and v2
as parameters.   Then we have simply:

void covariance_accumulate (struct design_matrix *cov,
const struct ccase *c);

I can see that there are a lot of procedures which will need
covariance matrices, so the simpler we make their creation, the better.

How many of these suggestions make sense?

J'

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.



signature.asc
Description: Digital signature