pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

address@hidden: Re: category.c]


From: Jason Stover
Subject: address@hidden: Re: category.c]
Date: Mon, 20 Mar 2006 10:51:15 -0500
User-agent: Mutt/1.5.10i

(I forgot to reply to the list.)

----- Forwarded message from Jason Stover <address@hidden> -----

Date: Mon, 20 Mar 2006 10:03:27 -0500
From: Jason Stover <address@hidden>
To: John Darrington <address@hidden>
Subject: Re: category.c
In-Reply-To: <address@hidden>
User-Agent: Mutt/1.5.10i

On Mon, Mar 20, 2006 at 09:03:21AM +0800, John Darrington wrote:
> I've been thinking about re-implementing T-TEST, ONEWAY and EXAMINE,
> using category.c and thus retiring the rather ad hoc group.c and
> factor-stats.c files.
> 
> Several questions about category.c :
> 
> 
> 1. cat_value_find uses a linear search.  Might is not be better to use
>    a hash instead?

Yes. category.c is my first attempt at cacheing the information
related to categorical variables, and there is probably a lot
of room for improvement.

> 2. Do we really need cat-routines.h ? Can it not be merged into
>    category.h ?

Separating them was a hack to prevent a build break, and the need to
do so may no longer exist. My memory is vague here, but there was an
email discussion that I can no longer find. The problem was something
like this: Most routines do not need to know about anything in
category.h or cat-routines.h, but variable.h includes category.h. When
cat-routines.h and category.h were in the same file, they caused some
compile-time errors when files that included variable.h did not also
know about everything related to category.h. I *think* the trouble may
have been a *.h file that referred to struct design_matrix. Whatever
the cause, I split category.h into two files, which may not have been
the best solution. And now, any need to keep them apart may no longer
exist.

> 3. cat_value_update seems to do nothing for numeric variables.  Why is
>    this?  A numeric variable can be used as a categorical variable
>    just as easily as an alpha one.

Good point. Encoding numeric data as categorical is usually a mistake
from a statistical standpoint, but there are circumstances when
treating a numeric variable as categorical makes perfect sense, so
maybe cat_value_update() shouldn't care what type of variable it is
looking at. This is where the question 'should we protect the user?'
comes up. Someone with a numeric variable that has, say, 10^5 distinct
values and inadvertently treats that variable as categorical could
wind up running a procedure with 0 or negative degrees of freedom;
slowing the machine down to a crawl; or, worst of all, finding bugs
we'd rather not know about. But users should probably have the ability
to treat numeric data as categorical if they want to.

> 4. If I'm reading the code right, cat_stored_values_destroy is leaky.
>    It frees obs_vals, but doesn't tidy up obs_vals->vals .
>    Also, shouldn't it set v->obs_vals to NULL after freeing?

You're right. That's a problem. I'll fix it soon if no one else fixes
it first. 

While we're on the topic, is anyone in favor of using a garbage
collector in PSPP?

-Jason


----- End forwarded message -----

-- 
Jason Stover
Assistant Professor
Mathematics Department
Georgia Kung Fu & State University
"Georgia's public martial arts university"
On the web at www.gksu.edu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]