pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quick cluster time and space optimisation


From: Mehmet Hakan Satman
Subject: Re: Quick cluster time and space optimisation
Date: Thu, 24 Mar 2011 07:02:41 -0700 (PDT)

Hi John,

The new entry point of kmeans_create() is

static struct Kmeans *
kmeans_create (struct casereader *cs, const struct variable **variables, int n, int m, int ngroups, int maxiter)

where 'struct casereader *cs' and 'const struct variable **variables' are created in cmd_quick_cluster(). Parsing works are done in cmd_quick_cluster and casereader is read and directly written to gsl_matrix. There is no longer 'double *data' for data handling and twice copying.

I corrected the quick_cluster.c by deleting the redundant variable definitions and commenting the unused methods. I have not seen any compilation errors an 'make check' results are 'ok' for each single test.

I am sending the file.

Best.

 
Mehmet Hakan Satman
http://www.mhsatman.com



From: John Darrington <address@hidden>
To: Mehmet Hakan Satman <address@hidden>
Cc: address@hidden
Sent: Wed, March 23, 2011 5:54:18 PM
Subject: Re: Quick cluster time and space optimisation

That was my idea, yes.

J'

On Wed, Mar 23, 2011 at 07:35:41AM -0700, Mehmet Hakan Satman wrote:
    Hi John,
   
    the method:
    struct Kmeans *
    kmeans_create (struct casereader *cr, int m, int ngroups, int maxiter);
   
    will read the data from the given casereader and write to gsl_matrix directly.
    We bring the
   
    data = "" (numobs * n * sizeof (double));
   
    away. And we will only use gsl_matrix. Right?
   
      Mehmet Hakan Satman
    http://www.mhsatman.com
   
   
   
   
   
    ________________________________
    From: John Darrington <address@hidden>
    To: Mehmet Hakan Satman <address@hidden>
    Cc: address@hidden
    Sent: Wed, March 23, 2011 12:58:08 PM
    Subject: Quick cluster  time and space optimisation
   
    On Tue, Mar 22, 2011 at 02:53:45AM -0700, Mehmet Hakan Satman wrote:
          Hi friends,
       
          I corrected the source file and prepared the test files as John suggested.
    I
   
          created the file quick-cluster.at with a test data. I run the "make check"
          command and some of the output of this process is like:
       
          436: T-TEST invalid syntax                          ok
          437: T-TEST string variable                          ok
          438: T-TEST string variable, only one value          ok
          439: T-TEST string variable comparison bug          ok
       
          QUICK CLUSTER
       
          440: QUICK CLUSTER                                  ok
       
   
    Wonderful!
   
    Now that we've got a working autotest we can start optimising the
    algorithm without fear of unwittingly breaking it.  This is where it
    starts to get interesting!
   
    The biggest problem I see at the moment is that it is allocating
    potentially huge blocks of memory.  PSPP (and spss) is designed to
    cope with extremely large amounts of data.  So we should try to
    accommodate this in the design of quick-cluster.
   
    So imagine for the moment that somebody wants to perform
    quick-cluster 500 variables (m), and 10,000,000 cases (n).  PSPP would
    run out of memory before it even starts to do the calculation.
    There's a number of places where we should eliminate these memory
    allocations, but let's deal with them one at a time.
   
    Firstly, I see that the entire dataset is copied not once, but twice:
    Once before the call to kmeans_create where you copy the casereader
    into the array.  And then again inside kmeans_create where you copy
    the array into a gsl_matrix.  I did some profiling on my system and
    this  accounted for a significant proportion of the time taken to
    perform QUICK CLUSTER.
   
    So,  I suggest that we start by changing the signature of
    kmeans_create from
   
    struct Kmeans *
    kmeans_create (double *data, int n, int m, int ngroups, int maxiter);
   
    struct Kmeans *
    kmeans_create (struct casereader *cr, int m, int ngroups, int maxiter);
   
    This will save at least one iteration through the data and one big
    chunk of memory.  Later we can improve on this.
   
    J'
   
   
   
   
   
    --
    PGP Public key ID: 1024D/2DE827B3
    fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
    See http://pgp.mit.edu or any PGP keyserver for public key.
   
   
         
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: quick-cluster.c
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]