[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: parametric models [Was: Re: musings on performance]
From: |
Jason Stover |
Subject: |
Re: parametric models [Was: Re: musings on performance] |
Date: |
Wed, 10 May 2006 21:55:31 -0400 |
User-agent: |
Mutt/1.5.10i |
On Wed, May 10, 2006 at 07:43:41PM +0800, John Darrington wrote:
> On Tue, May 09, 2006 at 05:36:16PM -0400, Jason Stover wrote:
> I would like to make PSPP able to:
>
> 1. Save models for later use within PSPP. 'Later uses' include
> combining them into other models, and assessing by comparing many
> models, mostly by checking their performance on 'scratch' data.
> 'Later uses' might also include fitting other models that could use
> some of the sufficient statistics (like sample means and covariance
> matrices). Saving models would not take much work if I can use the pool
> allocator to do so.
>
> Here's where I'm going to start showing my ignorance of statistical
> methods. What exactly do you mean by a "model"? How is it different
> (or similar) to the data saved by SPSS's MATRIX subcommand?
The MATRIX subcommand saves some pieces of what I would call the
estimated "model", like correlations, means and standard
deviations. Other procedures can make use of this information, but
there is a lot more that could be saved and used later. Much of what
could be saved and used later isn't anything a human would want to
read, but could be used by a machine.
I didn't define "model" specifically. I'll try do clarify what I'm
thinking without giving an exact definition. I am thinking of defining
the term inside PSPP to let the machine do with "models" what
statisticians and mathematicians do with them on paper, but can't
do in a machine.
In most discussions about modeling, the word "model" refers to the
mathematical description of the probabilistic behavior of the data.
This description is usually in the form of, or equivalent to, a
distribution function that must be estimated via the data and some
estimation algorithm. For example, the usual model for linear regression
with one explanatory variable is
1. dependent variable = (unknown intercept) + (unknown slope) *
(explanatory variable) + (random noise)
2. the random noise is normally distributed with zero mean and
common variance V.
Those two statements are usually referred to as the "model." They
completely determine the probability distribution of the dependent
variable: It must have a Gaussian distribution, with a mean of
(intercept) + (slope) * (explanatory variable), and a variance of
V. Those two statements also tell us how to estimate the unknown
slope, intercept and V.
I want to make the "model" that is saved an object (I don't mean to
use C++) that contains all the information from the MATRIX subcommand,
as well as pointers to functions to do common tasks like prediction
and finding residuals. I would also like to make a type called
"model", any of whose members have at least the following information
and relevant accessor functions stored in them:
* Parameter estimates and their standard errors.
* Pointers to variables used to fit the models.
* Functions to return predicted values and prediction intervals.
* Functions to return residuals.
* Maybe more
Each model object would look a bit like the definition of
pspp_linreg_cache. Or perhaps
struct model
{
void *m; /*points to a specific model type like
like pspp_linreg_cache */
double (*predict) (variable ** vars, ...);
double (*residual) (variable ** vars, ...);
/* ...whatever else */
}
R already makes model objects. If you estimate any model in R, it returns an
object that can predict, produce residuals or standard errors, etc.
I would like to make the model type in a way to allow different models
to be combined together. There is a lot of data mining literature
about this idea (Friedman, Hastie and Tibshirani's book is a good
reference). The ideas have names like boosting and bagging. A lot of
these methods lend themselves to the kind of parallelization Ben
mentioned. I don't know of any statistical software that can combine
models in a nice way. One reason the software doesn't exist is that
the guts of large models aren't directly interpretable by humans, but
statistical software is designed to create output that is
human-interpretable. But models that are aggregates of other models
often predict quite well, so I (and maybe other people) would like to
have a program that could aggregate models, and export them in some
useful format (like C).
> 2. Export models in some external formats so they can be used by another
> program later. The first format I was thinking of was compilable C. I
> suppose other formats like XML ought to be supported too, since SPSS
> can export some models as XML. Right now, REGRESSION has some ugly
> functions that let it write little C programs. I'd like to clean that
> code up and move it to a place where other procedures could use
> it.
>
> To learn how to do numbers 1 and 2, I should write a modeling procedure
> that fits a model quite different from that fit by REGRESSION, but one
> whose purpose is, like regression, to find a function f(input) that
> predicts some output. I was thinking of a neural network. Another
> possibility is a regression tree. I don't want this next procedure to
> resemble linear regression too closely, lest I inadvertently write
> model-shuffling procedures closely tailored to manipulation of one
> particular type of model.
>
>
> If you go down the neural net path, then I would suggest that a radial
> basis function net would be the thing to use.
Agreed. (Mostly because a multilayer perceptron has so many settings.)
-Jason