[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Quick Cluster and Casereaders
From: |
John Darrington |
Subject: |
Quick Cluster and Casereaders |
Date: |
Sat, 26 Mar 2011 09:19:27 +0000 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
Like you said, I think the time has come to see if we can avoid
copying the entire dataset into a gsl_matrix, since this wastes
both memory and cpu cycles.
Rather than attempting to do this all at once, I suggest that initially we
leave the kmeans->data matrix in place, but try to avoid using
it. I think this approach will make the development easier,
although we shan't see any benefit until the exercise is finished.
There are a couple of ground rules about using casereaders:
1. In general, only sequential access to the data is possible.
Random access like one can do with an array or gsl_matrix
is prohibited.
2. A casereader can be used once only. It is not possible to
re-read the same case from a casereader once it's been read.
If you want to read the data a second time, then the casereader
must be copied, using casereader_clone. In Quick Cluster we
iterate the data multiple times. So whenever we do so, we must
remember to enclose the iteration with casereader_clone/casereader_destroy
3. A casereader (like the name suggests) reads cases. Cases can be thought
of as arrays. But we normally "index" into them using a struct variable
instead of an integer, thus: double x = case_data (ccase, var)->f;
So with that in mind, several changes can be immediately made:
* We'll need a pointer to the casereader to be a member of struct Kmeans.
* In kmeans_create, before we call casereader_read, we must take a copy of
the casereader using casereader_clone. We must also destroy the casereader
after it has been read using casereader_destroy.
* In kmeans_calculate_indexed_and_check_convergence (if that name is too
cumbersome for you, feel free to change it) the loop:
for (i = 0; i < kmeans->n; i++)
to
for (; (c = casereader_read (cs)) != NULL; case_unref (c))
( Again, don't forget to call casereader_clone first, and
casereader_destroy afterwards. )
* Now we have to change the signature of kmeans_euclidian_distance.
It needs to become:
double
kmeans_euclidean_distance (struct Kmeans *, const gsl_vector *, const struct
ccase *);
If we do this, and the autotest still works, then we'll be half way to having
eliminated kmeans->data. I think it'll be easier to see what we need to do
next,
if you go through the code, and replace "kmeans->data->size2" with "kmeans->m"
similarly, kmeans->centers->size2 can become kemeans->m and
kmeans->centers->size1 becomes
kmeans->ngroups. etc.
Good luck. Email or irc if you have questions.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Quick Cluster and Casereaders,
John Darrington <=