From: John Darrington
<address@hidden>
To: Mehmet Hakan Satman <address@hidden>
Cc: address@hidden
Sent: Wed, March 23, 2011 12:58:08 PM
Subject: Quick cluster time and space optimisation
On Tue, Mar 22, 2011 at 02:53:45AM -0700, Mehmet Hakan Satman wrote:
Hi friends,
I corrected the source file and prepared the test files as John suggested. I
created the file quick-cluster.at with a test data. I run the "make check"
command and some of the output of this process is like:
436: T-TEST invalid syntax ok
437: T-TEST string variable ok
438: T-TEST string variable, only one value ok
439: T-TEST string variable comparison bug ok
QUICK CLUSTER
440: QUICK CLUSTER ok
Wonderful!
Now that we've got a working autotest we can start optimising the
algorithm without fear of unwittingly breaking it. This is where it
starts to get interesting!
The biggest problem I see at the moment is that it is allocating
potentially huge blocks of memory. PSPP (and spss) is designed to
cope with extremely large amounts of
data. So we should try to
accommodate this in the design of quick-cluster.
So imagine for the moment that somebody wants to perform
quick-cluster 500 variables (m), and 10,000,000 cases (n). PSPP would
run out of memory before it even starts to do the calculation.
There's a number of places where we should eliminate these memory
allocations, but let's deal with them one at a time.
Firstly, I see that the entire dataset is copied not once, but twice:
Once before the call to kmeans_create where you copy the casereader
into the array. And then again inside kmeans_create where you copy
the array into a gsl_matrix. I did some profiling on my system and
this accounted for a significant proportion of the time taken to
perform QUICK CLUSTER.
So, I suggest that we start by changing the signature of
kmeans_create from
struct Kmeans *
kmeans_create (double *data,
int n, int m, int ngroups, int maxiter);
struct Kmeans *
kmeans_create (struct casereader *cr, int m, int ngroups, int maxiter);
This will save at least one iteration through the data and one big
chunk of memory. Later we can improve on this.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See
http://pgp.mit.edu or any PGP keyserver for public key.