Re: K-Means Clustering

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: K-Means Clustering

From:	John Darrington
Subject:	Re: K-Means Clustering
Date:	Thu, 10 Mar 2011 11:07:59 +0000
User-agent:	Mutt/1.5.18 (2008-05-17)

Hello Mehmet,

I think the code is well written and is something that we could use in PSPP.

Like you say, perhaps it doesn't lend itself to a generalized clustering 
algorithm,
but I think it would nicely satisfy the needs of the QUICK CLUSTER command.

Would you like to try implementing such a command in PSPP ?   Ask on the 
mailing list
if you need help getting started.

Regards, 


John

On Wed, Mar 09, 2011 at 05:51:44AM -0800, Mehmet Hakan Satman wrote:
     Hi John,
     
     I am sending the .h, the .c and the test.c file as attachments. 
     
     
     My implementation clusters a random data set with number of cases 10000, 
the 
     number of parameters 10 and the number of clusters 10 in average 0.45 
seconds. 
     
     
     Algorithm starts with the randomly selected cluster centers and in each 
single 
     steps these centers are changed. Distances between these centers and the 
     observations must be calculated in every step. I am not sure if it can be 
     implemented a "create-one-and-use-anywhere" distances in this algorithm. 
In 
     hierarchial cluster algorithms such as "single linkage clustering", it is 
     possible to build a distance matrix and use it during the clustering.
     
     Best Regards.
     
     Hakan.
     
     
     
     
     ________________________________
     From: John Darrington <address@hidden>
     To: Mehmet Hakan Satman <address@hidden>
     Cc: address@hidden
     Sent: Wed, March 9, 2011 3:21:59 PM
     Subject: Re: K-Means Clustering
     
     Hi Mehmet!
     
     As I mentioned, the CLUSTER command is soemthing which I think it would be
     great to support.
     
     One issue with clustering is its memory complexity. It requires O(n^2) 
where
     n is the number of cases being clustered.
     Have you tested your algorithm with large numbers of cases?
     
     Maybe Ben has some ideas how an efficient distance matrix can be  
implemented
     in PSPP (maybe sparse-array.c can help?) .
     
     In any case, I'd be interested to see your code, and the results of your 
     comparisons.  Can you post them somewhere?
     
     
     J'
     
     On Tue, Mar 08, 2011 at 05:56:37AM -0800, Mehmet Hakan Satman wrote:
          hi everybody,
         
          I am interested in PSPP and i read about something about the needs 
for 
          developing some functionality.
          I implemented a k-means clustering  library using the GNU scientific 
     library and 
     
          sent an informative e-mail to John. He suggested me to join this 
group and 
     share 
     
          my ideas with the stuff. 
         
         
          I compared the results with SPSS outputs. The analysis of variance 
table is 
     not 
     
          completed but we may add this feature.
         
          I would be glad to integrate something to PSPP and work with you.
         
          What do you think about this?
         
         
               
     -- 
     PGP Public key ID: 1024D/2DE827B3 
     fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
     See http://pgp.mit.edu or any PGP keyserver for public key.
     
     
           
     /*
         kmeans - K-Means Clustering Library for the GNU PSPP (People Should 
Prefer PSPP) project.
         Copyright (C) 2011  Dr.Mehmet Hakan Satman <address@hidden>
     
         This program is free software: you can redistribute it and/or modify
         it under the terms of the GNU General Public License as published by
         the Free Software Foundation, either version 3 of the License, or
         (at your option) any later version.
     
         This program is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
         GNU General Public License for more details.
     
         You should have received a copy of the GNU General Public License
         along with this program.  If not, see <http://www.gnu.org/licenses/>.
     */
     #include <stdio.h>
     #include <math.h>
     #include <gsl/gsl_matrix.h>
     #include <gsl/gsl_statistics.h>
     
     
     /*
     Struct KMeans:
     Holds all of the information for the functions.
     */
     struct Kmeans {
         gsl_matrix *data;           //User Data (Given by the user)
         gsl_matrix *centers;        //Centers for groups
         gsl_vector_int *index;      //Integer values from zero to ngroups. 
Shows group of an observation.
         gsl_vector *v1,*v2,*v3;     //Temporary vector for program. Do not use.
         int ngroups;                //Number of group. (Given by the user)
         int n;                      //Number of observations. (Given by the 
user)
         int m;                      //Number of observations. (Given by the 
user)
         int maxiter;                //Maximum number of iterations (Given by 
the user)
         int lastiter;               //Show at which iteration it found the 
solution.
         double *weights;            //Double values for handling weights for 
program use.
     };
     
     /*
     Creates a Kmeans structure for a given data 'data', number of observations 
'n',
     number of columns 'm', number of groups 'ngroups' (it is usually called 
'k') and
     number of maximum iterations 'maxiter'.
     */
     struct Kmeans* kmeans_create(double* data, int n, int m, int ngroups, int 
maxiter);
     
     
     /*
     Randomly chooses centers of 'ngroup' groups.
     These centers are initial and will be changed iteratively.
     */
     void kmeans_randomize_centers(struct Kmeans *kmeans);
     
     /*
     Prints the given Kmeans structure
     */
     void kmeans_print(struct Kmeans* kmeans);
     
     
     /*
     Calculates the squared euclidean distance between vector v1 and v2
     */
     double kmeans_euclidean_distance(gsl_vector *v1, gsl_vector *v2);
     
     
     /*
     Calculates and returns the number of elements contained in the
     specific group.
     */
     int kmeans_num_elements_group(struct Kmeans *kmeans, int group);
     
     
     /*
     Calculates group centers. Those centers are calculated using iteration
     and they are usually different from the initial centers. Recalculation
     process remains while the last two solutions are not equal.
     */
     void kmeans_recalculate_centers(struct Kmeans *kmeans);
     
     
     /*
     Constructs the index variable. This variable shows the current
     group of the each single observation.
     */
     void kmeans_calculate_indexes(struct Kmeans *kmeans);
     
     
     /*
     Checks if the last two index variables are equal. If they are equal,
     algorithm can not find a better classification anymore and stops.
     */
     int kmeans_check_converge(gsl_vector_int *current, gsl_vector_int *old);
     
     
     /*
     This is the main method of the algorithm.
     */
     void kmeans_cluster(struct Kmeans *kmeans);
     
     
     

     #include "kmeans.h"
     
     
     struct Kmeans*
     kmeans_create(double* data,
                   int n,
                   int m,
                   int ngroups,
                   int maxiter){
         int i,j;
         struct Kmeans *k = (struct Kmeans*) malloc(sizeof(struct Kmeans));
         k->data=gsl_matrix_alloc(n,m);
         k->centers=gsl_matrix_alloc(ngroups, m);
         k->ngroups=ngroups;
         k->index=gsl_vector_int_alloc(n);
         k->n=n;
         k->m=m;
         k->maxiter=maxiter;
         k->lastiter=0;
         for (i=0;i<n;i++){
             for(j=0;j<m;j++){
                 gsl_matrix_set(k->data, i, j, data[i * m +j]);
             }
         }
         k->weights = (double*)malloc(sizeof(double) * k->index->size);
         k->v1 = gsl_vector_alloc(k->centers->size2);
         k->v2 = gsl_vector_alloc(k->centers->size2);
         k->v3 = gsl_vector_alloc(k->n);
         return(k);
     }
     
     void
     kmeans_randomize_centers(struct Kmeans *kmeans){
         int i,j;
         double min=0,max=0;
         min=gsl_matrix_min(kmeans->data);
         max=gsl_matrix_max(kmeans->data);
         gsl_matrix_minmax(kmeans->data, &min, &max);
         for (i=0;i<kmeans->centers->size1;i++){
             for (j=0;j<kmeans->centers->size2; j++){
                 gsl_matrix_set(kmeans->centers, i, j, min + 
(((double)rand())/RAND_MAX)*(max-min));
             }
         }
     }
     
     
     
     void
     kmeans_print(struct Kmeans* kmeans){
         int i,j;
         printf("Number of groups: %d\n",kmeans->ngroups);
         printf("Centers:\n");
         for (i=0;i<kmeans->centers->size1;i++) {
             for (j=0;j<kmeans->centers->size2;j++){
                 printf("%f ",gsl_matrix_get(kmeans->centers, i,j));
             }
             printf("\n");
         }
     
         printf("Index:\n");
         for (i=0;i<kmeans->n;i++){
             printf("%d ",gsl_vector_int_get(kmeans->index, i));
         }
         printf("\nLast iter: %d\n",kmeans->lastiter);
     }
     
     
     void print_matrix(gsl_matrix *m){
         int i,j;
         for (i=0;i<m->size1;i++){
             for (j=0;j<m->size2;j++){
                 printf("%f ",m->data[i * m->size2 + j]);
             }
             printf("\n");
         }
     }
     
     
     double
     kmeans_euclidean_distance(gsl_vector *v1,
                               gsl_vector *v2){
         double result=0.0;
         double val;
         int i;
         for (i=0;i<v1->size;i++){
             val=v1->data[i] - v2->data[i];
             result+=val*val;
         }
         return(result);
     }
     
     
     
     int
     kmeans_num_elements_group(struct Kmeans *kmeans, int group){
         int total=0;
         int i;
         for (i=0;i<kmeans->n;i++){
             if(gsl_vector_int_get(kmeans->index,i)==group) total++;
         }
         return(total);
     }
     
     
     void
     kmeans_recalculate_centers(struct Kmeans *kmeans){
         int i,j,h;
         int elm;
         double mean;
         gsl_vector *v1=kmeans->v3;
     
         for (i=0;i<kmeans->ngroups;i++){
             elm=kmeans_num_elements_group(kmeans,i);
             for (j=0;j<kmeans->index->size;j++){
                 if(gsl_vector_int_get(kmeans->index,j)==i){
                     kmeans->weights[j]=1.0;
                 }else{
                     kmeans->weights[j]=0.0;
                 }
             }
     
             for (h=0;h<kmeans->m;h++){
                 gsl_matrix_get_col(v1,kmeans->data, h);
                 mean=gsl_stats_wmean(kmeans->weights, 1, v1->data ,1, 
v1->size);
                 gsl_matrix_set(kmeans->centers, i,h, mean);
             }
         }
     }
     
     
     void
     kmeans_calculate_indexes(struct Kmeans *kmeans){
         double dist;
         double mindist;
         int bestindex=0;
         int i,j;
         gsl_vector *v1 = kmeans->v1;
         gsl_vector *v2 = kmeans->v2;
     
         for (i=0;i<kmeans->n; i++){
             mindist=INFINITY;
             gsl_matrix_get_row(v1, kmeans->data, i);
             for (j=0;j<kmeans->ngroups; j++){
                 gsl_matrix_get_row(v2, kmeans->centers,j);
                 dist=kmeans_euclidean_distance(v1,v2);
                 if(dist<mindist){
                     mindist=dist;
                     bestindex=j;
                 }
             }
             gsl_vector_int_set(kmeans->index,i,bestindex);
         }
     }
     
     
     
     int
     kmeans_check_converge(gsl_vector_int *current,
                           gsl_vector_int *old){
         int i;
         int total=0;
         for (i=0;i<current->size;i++) {
             if(current->data[i] == old->data[i]) total++;
             old->data[i]=current->data[i];
         }
         return(current->size-total);
     }
     
     
     
     gsl_matrix*
     kmeans_getGroup(struct Kmeans *kmeans, int index){
             int i;
             int j=0;
             int elm=kmeans_num_elements_group(kmeans,index);
             gsl_matrix *agroup=gsl_matrix_alloc(elm, kmeans->data->size2);
             gsl_vector *v1=gsl_vector_alloc(kmeans->data->size2);
             for(i=0;i<kmeans->data->size1;i++){
                 if(kmeans->index->data[i]==index){
                     gsl_matrix_get_row(v1, kmeans->data, i);
                     gsl_matrix_set_row(agroup, j, v1);
                     j++;
                 }
             }
             return(agroup);
     }
     
     
     
     void
     kmeans_cluster(struct Kmeans *kmeans){
         int i;
         gsl_vector_int *oldindex = gsl_vector_int_alloc(kmeans->index->size);
         kmeans_randomize_centers(kmeans);
     
         for (i=0;i<kmeans->maxiter;i++){
             kmeans->lastiter=i;
             kmeans_calculate_indexes(kmeans);
             kmeans_recalculate_centers(kmeans);
             if(kmeans_check_converge(kmeans->index, oldindex)==0) break;
         }
     
     }
     
     

     #include "kmeans.h"
     
     int
     main()
     {
         srand(12345);
         int i,j;
         double val;
         int n=10000; //Number of cases
         int m=10;    //Number of variables
         int k=10;    //Number of groups
         int maxiter=100;
         double * data = (double*) malloc(n* m * sizeof(double));
         for (i=0;i<n;i++){
             for (j=0;j<m;j++){
                 val= ((float)rand())/(float)RAND_MAX;
                 data[i * m + j]=val*(i+j);
             }
         }
     
         struct Kmeans *km = kmeans_create(data, n, m ,k, maxiter);
         kmeans_cluster(km);
         //kmeans_print(km);
     
         return 0;
     }


-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

K-Means Clustering, Mehmet Hakan Satman, 2011/03/08
- Re: K-Means Clustering, John Darrington, 2011/03/09
  - Re: K-Means Clustering, Duane Currie, 2011/03/09
  - Re: K-Means Clustering, Ben Pfaff, 2011/03/09
    - Re: K-Means Clustering, John Darrington, 2011/03/10
  - Message not available
    - Re: K-Means Clustering, John Darrington <=
    - Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/10
    - Re: K-Means Clustering, John Darrington, 2011/03/10
    - Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/10
    - Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/11
    - Re: K-Means Clustering, John Darrington, 2011/03/12
- Re: K-Means Clustering, John Darrington, 2011/03/13
- Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/14
  - Re: K-Means Clustering, John Darrington, 2011/03/15
    - Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/15
    - Re: K-Means Clustering, John Darrington, 2011/03/15

Prev by Date: Re: K-Means Clustering
Next by Date: Re: K-Means Clustering
Previous by thread: Re: K-Means Clustering
Next by thread: Re: K-Means Clustering
Index(es):
- Date
- Thread