pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Next step in covariance matrix


From: John Darrington
Subject: Re: Next step in covariance matrix
Date: Tue, 27 Oct 2009 18:25:32 +0000
User-agent: Mutt/1.5.18 (2008-05-17)

So it sounds as if the next step is simply to drop one column per categorical 
variable.
That should be quite simple.

Will that be enough to allow a subset of GLM to be implemented?

J'

On Tue, Oct 27, 2009 at 11:47:23AM -0400, Jason Stover wrote:
     On Tue, Oct 27, 2009 at 06:38:19AM +0000, John Darrington wrote:
     > Just to make sure I understand things correctly, consider the following 
example, 
     > where x and y are numeric variables and A and B are categorical ones:
     > 
     > x y A B
     > =======
     > 3 4 x v
     > 5 6 y v
     > 7 8 z w
     > 
     > We replace the categorical variables with bit_vectors:
     > 
     > x y A_0 A_1 A_2  B_0 B_1
     > ========================
     > 3 4  1   0   0    1   0
     > 5 6  0   1   0    1   0
     > 7 8  0   0   1    0   1
     > 
     > and arbitrarily drop the (say zeroth) subscript:
     > 
     > x y  A_1 A_2   B_1
     > ==================
     > 3 4   0   0     0
     > 5 6   1   0     0
     > 7 8   0   1     1
     > 
     > That will produce a 5x5 matrix. 5 is calculated from n + m - p,  where 
     > n is the number of numeric  variables, m is the total number of  
categories,
     > and p is the number of categorical variables.  
     
     This is correct. 
     
     > However I don't see how such a matrix can be very useful. A better one 
would involve 
     > the  products of the categorical and numeric variables:
     > 
     > x y  x*A_1 x*A_2  y*A_1 y*A_2   x*B_1 y*B_1
     > ===========================================
     > 3 4     0   0        0     0       0     0
     > 5 6     5   0        6     0       0     0
     > 7 8     0   7        0     8       7     8
     > 
     > This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) , 
     > which happens to be identical to n * (1 + m - p).  But this involves
     > a whole lot more calculations.
     
     This second choice would give you the covariance of x and y, and the
     covariances of the *interactions* between x and A, x and B, y and A,
     and y and B, but not the covariance between (say) x and A. The
     covariance between x and A would be stored in the first matrix you
     mentioned, in elements (0,2), (0,3), (2,0) and (3,0) assuming we kept
     both upper and lower triangles.
     
     You mention that matrix not being very useful, and in a sense it
     isn't: No human would care about the covariance between x and the
     column corresponding to the first bit vector of A. But in another
     sense, that matrix is absolutely necessary: It's used to solve the
     least squares problem, whose solution we use to tell us if A and our
     dependent variable are related. That relation is shown via analysis of
     variance, whose p-value is many computations away from the covariance
     matrix, but depends on it nevertheless.
     
     This matrix is unnecessary for a one-way ANOVA, whose computations from
     the matrix above can be simplified into the simple sums used in
     oneway.q.  But for a bigger model, with many factors and interactions
     and covariates, we need that first matrix because we can't reduce the
     problem to a few easy-to-read summations.
     
     
     _______________________________________________
     pspp-dev mailing list
     address@hidden
     http://lists.gnu.org/mailman/listinfo/pspp-dev

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]