[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regression lib

From: Ben Pfaff
Subject: Re: regression lib
Date: Mon, 02 May 2005 08:24:57 -0700
User-agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> Currently there's no caching of statistics.  Each procedure
> calculates them for itself, which is less than ideal because it leads
> to a lot of duplication. For example group.c largely duplicates
> factor_stats.c

Hmm.  If so, I think that's probably orthogonal to the caching
problem.  Is there some reason those files can't share some
common code to perform their common functionality?  For example,
DESCRIPTIVES and FREQUENCIES both output descriptive statistics,
but those have been factored into the `moments' framework in
moments.[ch] so there's little redundant code.

> --- I think there should be some framework for caching these
> values like Jason suggests.  The problem is, comming up with a
> model which is flexible enough to suit our purpose and yet
> simple enough to understand.
> It's not only mean and stddev.  I can foresee dozens of procedures
> which need to calculate sst sse etc.   It would be good if
> applications could just look these values up in a cache.  But there's
> a lot of issues to consider:
> * The cache would have to be invalidated every time a transformation
>   is done.

This is something we'll just have to deal with.  I don't think
it's too hard.  We just add a `statcache_invalidate(variable)'
function and call it for the modified variables from every
transformation that modifies variables, plus a
`statcache_invalidate_all()' function that invalidates everything
for procedures that modify the entire file (e.g. MATCH FILES).
(SORT might be an interesting special case--it wouldn't, for
example, disturb descriptive statistics or frequency tables.)

> * Caching would be useful not only on complete variables, but also on
>   subsets of cases.  Eg. variable X, factored by variable Y.  So how
>   does one define all the posibilities?

I have two ideas:

        1. Ignore the problem.  Only cache statistics on complete

        2. Try to handle some special cases as special cases.
           For example, if FILTER BY <VAR> is in effect, then we
           could cache those values as long as FILTER BY <VAR>
           remained in effect and <VAR> was unmodified.

> * Each statistic (eg: mean, stddev) will be different depending upon
>   the specification of the procedure's /MISSING subcommand.

The most common case is "itemwise" missing with user-missing
values removed.  We can ignore other cases if we want to.  When
you're caching, you want to save time in the most common cases.
If you can save time in other cases, too, that's great, but it's
not as valuable because they don't come up as much.

> All these things complicate the implementation and would mean that the
> potential cache space would quite large.

But you don't reserve space for all of them on each variable.
You just allocate space as you need it.  Furthermore, because the
cache is just an optimization, you can throw it, or part of it,
away if it gets too large.

I think this came up before and I threw up some of these same
objections.  They are problems, sure.  But they are problems we
can deal with and I think we should, sometime post-0.4.0.
Ben Pfaff 
email: address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]