Quick Cluster and Casereaders

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Quick Cluster and Casereaders

From:	John Darrington
Subject:	Quick Cluster and Casereaders
Date:	Sat, 26 Mar 2011 09:19:27 +0000
User-agent:	Mutt/1.5.18 (2008-05-17)

Like you said, I think the time has come to see if we can avoid
copying the entire dataset into a gsl_matrix, since this wastes
both memory and cpu cycles.

Rather than attempting to do this all at once, I suggest that initially we
leave the kmeans->data matrix in place, but try to avoid using 
it.  I think this approach will make the development easier, 
although we shan't see any benefit until the exercise is finished.

There are a couple of ground rules about using casereaders:

1. In general, only sequential access to the data is possible.
   Random access like one can do with an array or gsl_matrix
   is prohibited.

2. A casereader can be used once only.  It is not possible to
   re-read the same case from a casereader once it's been read.
   If you want to read the data  a second time, then the casereader
   must be copied, using casereader_clone.  In Quick Cluster we
   iterate the data multiple times.  So whenever we do so, we must
   remember to enclose the iteration with casereader_clone/casereader_destroy

3. A casereader (like the name suggests) reads cases.  Cases can be thought
   of as arrays.  But we normally "index" into them using a struct variable
   instead of an integer, thus: double x = case_data (ccase, var)->f;


So with that in mind, several changes can be immediately made:

* We'll need a pointer to the casereader to be a member of struct Kmeans.

* In kmeans_create, before we call casereader_read, we must take a copy of
  the casereader using casereader_clone.  We must also destroy the casereader
  after it has been read using casereader_destroy.

* In kmeans_calculate_indexed_and_check_convergence (if that name is too
  cumbersome for you, feel free to change it) the loop:

   for (i = 0; i < kmeans->n; i++)

  to

    for (; (c = casereader_read (cs)) != NULL; case_unref (c))

  ( Again, don't forget to call casereader_clone first, and
  casereader_destroy afterwards. )  

* Now we have to change the signature of kmeans_euclidian_distance.
  It needs to become:
  
  double
  kmeans_euclidean_distance (struct Kmeans *, const gsl_vector *, const struct 
ccase *);


If we do this, and the autotest still works, then we'll be half way to having
eliminated kmeans->data.  I think it'll be easier to see what we need to do 
next, 
if you go through the code, and replace "kmeans->data->size2" with "kmeans->m"
similarly, kmeans->centers->size2 can become kemeans->m and 
kmeans->centers->size1 becomes
kmeans->ngroups. etc.




Good luck.  Email or irc if you have questions.

J'


-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Quick Cluster and Casereaders, John Darrington <=

Prev by Date: Re: Quick cluster time and space optimisation
Next by Date: Re: libgda-ui
Previous by thread: New template for 'pspp' made available
Next by thread: --disable-shared
Index(es):
- Date
- Thread