Re: data sets and caching

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: data sets and caching

From:	Jason Stover
Subject:	Re: data sets and caching
Date:	Mon, 31 Oct 2005 20:50:15 +0000
User-agent:	Mutt/1.4.2.1i

On Mon, Oct 31, 2005 at 10:25:20AM -0800, Ben Pfaff wrote:
> Jason Stover <address@hidden> writes:
> 
> > I need to be able to append residuals to the active file
> > with a 'save' subcommand. How should I go about this?
> 
> Would you like to save them for a single session only,
> or should it be possible to save them to disk and retrieve them
> in later sessions?

Good question. I had intended to save them to the working data file,
as the SPSS SAVE subcommand does in its regression procedure.  Users
mostly like to look at residuals and run tests on them after the model
has been fit. But if this working data file is written to disk, the
residuals are written with it, and can be used later. 

> 
> [...]
> 
> > Here is an example of syntax that shows what users would want
> > to be able to do (I'm using hypothetical syntax to illustrate
> > the idea):
> >
> > regression /data=train_data /variables=v0 v1 v2 /statistics default
> >      /dependent=v2 /method=enter /name=model1.
> >
> > nlr /data=train_data /variables=v0 v1 v2 /statistics default /dependent=v2 
> > /method=enter /name=model2.
> >
> > model_compare /data=test_data model1 model2 /criteria ssresid absdev.
> 
> Let me see if I understand this.  Please correct me if I am
> wrong: REGRESSION and NLR take the same input data (training
> data) and fit its structure according to different models.
> REGRESSION's model is saved as model1, NLR's model as model2.
> Then MODEL_COMPARE compares the effectiveness of these models on
> a second set of data (test data), using the saved models.

Yes.

> 
> > This syntax illustrates two design changes that would make pspp more 
> > flexible
> > for users.
> >
> > 1. The user can name the output from any procedure.  [...]
> 
> This looks good to me.  Do you have a good idea for syntax?  It
> would be nice if the syntax were uniform across procedures, so
> we'd want a keyword that wasn't already used (much) and ideally
> one unique in its first three letters.  "name" seems a little too
> generic for that purpose.

I do not have a good idea for syntax, but will look into it. If 'name'
is too generic, then the subcommand should indicate that we are naming
a model, so maybe 'modname' or 'mname'? (I'll think about it more.)

[...]

> 
> When would the cache no longer be needed?  i.e. do models ever
> become invalid?
> 

Right now, I can think of only two types of situations that
unequivocally necessitate freeing a model cache: The user creates a
new model with the name of an old one, or the memory/disk space fills
up. Sometimes a model may become invalid, but we cannot know when that
will happen while we are writing PSPP. There should probably be some
way to allow a user to free a cache via syntax, if the user decides to
do so. I have accidentally used a model object in R after I should
have destroyed it, and R has either broken or output garbage as a
result. Same for Clementine and SAS.

> > 2. Users can name data sets to be used in a procedure. Then PSPP could
> > fit models to different data sets and evaluate them using a 'test'
> > data set. PSPP could also be made to manipulate multiple data sets
> > (such as merging them). SAS users spend a lot of time sorting,
> > merging, concatenating and de-duplicating data sets. SPSS does not
> > allow this, and that is one reason for SAS' popularity. PSPP's
> > inability to do this makes it less attractive to users. I know
> > this functionality lies beyond cloning SPSS, but it is functionality
> > users find important, and other free statistical software can't do it
> > (as far as I know). R names each data set, and it can sort, but users
> > cannot combine and de-duplicate data sets as easily as they can with
> > SAS. R cannot work with the large data sets that SAS can use, either.
> 
> This is the "data" keyword above?  Would this simply be a matter
> of supporting multiple, named "active files"?  I think that would
> not be a huge amount of work, although it would be kind of tricky
> to verify it was correct.  Most of the representation of the
> active file is encapsulated in the `dictionary' object, and it
> would be possible to add support for multiple instances of other
> objects (e.g. the virtual file manager) as necessary.
> 
> The work needed is partly clean-ups in the code base that I want
> to do anyway.

Yes, that is what I was thinking of with the 'data' keyword. 
It would be a matter of supporting multiple, named active files.

> 
> I don't know whether a "name" keyword on procedures would be
> sufficient for this purpose, because transformations that precede
> procedure invocation need to know what active file they're
> working out of.  That's assuming that the different active files
> can have different dictionaries; if their dictionaries are
> identical and they just have different data sets, then it
> wouldn't be necessary as far as I can tell.

Now I have a question: Do you mean that the 'name' keyword would be
insufficient because just naming the cache of a procedure tells
it nothing of the data set used to create the cache? So if a data
set is modified, that cache may believe incorrect things about that
data set? 

Do you think it would be beneficial to use a garbage collector
for cache-allocation? (Like the Boehm's, which is not entirely GPL'd?)

-Jason

-- 
address@hidden
SDF Public Access UNIX System - http://sdf.lonestar.org

[Prev in Thread]

Current Thread

[Next in Thread]

data sets and caching, Jason Stover, 2005/10/31
- Re: data sets and caching, Ben Pfaff, 2005/10/31
  - Re: data sets and caching, Jason Stover <=
    - Re: data sets and caching, Ben Pfaff, 2005/10/31

Prev by Date: Re: data sets and caching
Next by Date: categorical variables again
Previous by thread: Re: data sets and caching
Next by thread: Re: data sets and caching
Index(es):
- Date
- Thread