[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GLM/ANOVA
From: |
Jason Stover |
Subject: |
Re: GLM/ANOVA |
Date: |
Sat, 7 Jun 2008 16:18:42 -0400 |
User-agent: |
Mutt/1.5.10i |
On Sat, Jun 07, 2008 at 10:09:13AM +0100, Ed wrote:
> 2008/6/5 Jason Stover <address@hidden>:
> > This is a nice way to understand the problem, but I don't think it's
> > the right way to code it. I would like to keep lib/linreg as a
> > univariate library. That way, it is useful to other procedures in an
> > obvious way ("lib/linreg: solve the normal equations for me now!")
> > Also, it's complex enough dealing with Level 2 linear algebraic
> > operations. Making it deal with Level 3 to solve systems of matrices
> > could be done more easily by writing another, higher-level library
> > that uses lib/linreg to do the lower-level solving of systems.
>
> I didn't totally understand this. I would have thought that either
> reusing linreg, or writing a separate routine that made all its calls
> to matrix algebra would be best. Calling a vector routine multiple
> times to perform matrix algebra must be slow, and surely defeats all
> the blocking code for fast matrix multiply. So I guess my suggestion
> is to have a mlinreg.{c,h} that just does linreg in matrices if
> reusing linreg is problematic.
I don't know what the best way is on this. I would use these guiding
principles, which may conflict with each other occasionally:
0. Level n linear algebra can be easily coded using already-existing
Level n-1 code. This is the lesson of engineering of making modules
out of other, more detailed, pre-existing modules. This allows us to
build on what's already there, saving us the time of re-creating
anything.
1. "Optimization" is over-rated. Calling vector routines to implement
matrix routines may not be fast, but it's easy to code and easy to
judge as being correct or incorrect. How much slower is it for most
purposes anyway? Say using the vector routines (the slow way) gets you
a final algorithm that runs in O(n^3) time, but using the faster
algorithm runs in O(n^2.999). If that faster algorithm takes many
hours to code, and is hard to maintain because it's abstruse, I'd say
it's a waste of time (unless you have nothing better to do).
2. Someone else may have to read your code and maintain it, and they
may not be as smart as you. Wouldn't it be nice to have a program that
even a monkey can maintain? One that works even if only second-rate
programmers keep it going? Ben or John are first-rate programmers, but
there aren't many first-rate programmers out there, so why lower the
chance of your program's success by making it hard to understand? Now,
writing clever code may be necessary to get that O(n^2) algorithm down
to O(log(n)), but if you can't get that extreme an improvement, then
writing dumb code makes life easier for other programmers, and for the
program. (I mean "dumb" as in "obvious", not "incorrect".)
3. Sometimes it's easier just to "roll your own." Using linreg.c may
take longer than just writing your own functions. Using GSL's matrix
multiplication may take longer than writing your own.
4. It's good to make procedures with a slightly bigger "mouth" than...
whatever makes the output. By that I mean that a procedure should accept
multiple kinds of output and still do what's expected. That way it can
be used later to solve unanticipated problems without being changed
much.
5. A big improvement (like being able to use a data set much larger
than the memory, or a O(log(n)) algorithm instead of a O(n^2) one) is
something worth having, even if it means writing code that isn't so
dumb.
It may be easy to change linreg.c. I guess I would have to see the
final code to decide. What I know is that I personally don't often
need the matrix expression, and can just stick to the usual
vector-version, so the extra overhead in the more general case is
usually a burden. If I were going to code MANOVA, I would add a
specialized file for it. But maybe that isn't the best way. Time will
tell. Or maybe it won't.
> > Also, most users aren't going to need the full matrix systems since
> > most users are just doing the usual linear regression with qualitative
> > variables. So unless the user specifically asks for MANOVA,
> > lib/linreg can already handle the job.
>
> I don't want to put words into your mouth, I just want to check if I
> got this right. You seem to be saying that linreg is basically the
> core of GLM univariate from SPSS already, which is true except for the
> handling of random factors. I guess since the solution for those is
> just postmultiplication by another matrix, this is fine.
>
> In which case, perhaps the sensible thing for me to aim for first is
> to provide the GLM Univariate functionality using linreg. This makes
> the first steps support for more complex models in the design matrix,
> the various types of sums of squares, and the unianova commands, which
> seems reasonable.
Yes, that's what I think.
> > There is another reason not to change lib/linreg: lib/linreg currently
> > assumes it has been passed a matrix consisting of all the data. I
> > would like to change this to make PSPP better able to handle large
> > data sets. This will mean that lib/linreg is going to have to know
> > whether it has been passed all the data, or products of the design
> > matrix and the vector of the dependent variable. That change to
> > lib/linreg will make it more complex than it already is.
>
> I guess my argument for reuse is just that one might imagine most of
> the code to handle large datasets would be shared between the two
> routines. If its simply two execution paths in one routine, then
> there's no replica to maintain; essentially, one routine gets it
> "free". The more sophisticated and complicated the code is, the more
> benefit to not coding it twice.
Yes.
> > We will have to look up the more abstruse sums of squares and
> > statistics as we need them. I don't know if all that stuff should be
> > in a linear regression library, or outside in the code of a
> > procedure. It should be outside in a procedure if it's not something
> > that other procedures might need; inside a library otherwise.
>
> I was just going to have a sumofsquares.{c,h} somewhere that can take
> a convert a set of regression results into sums of squares of the
> various types. Presumably it would be called by whatever drives the
> regression stuff, probably the UNIANOVA command or something similar.
That sounds good.
-Jason