[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: K-Means Clustering
From: |
John Darrington |
Subject: |
Re: K-Means Clustering |
Date: |
Thu, 10 Mar 2011 11:07:59 +0000 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
Hello Mehmet,
I think the code is well written and is something that we could use in PSPP.
Like you say, perhaps it doesn't lend itself to a generalized clustering
algorithm,
but I think it would nicely satisfy the needs of the QUICK CLUSTER command.
Would you like to try implementing such a command in PSPP ? Ask on the
mailing list
if you need help getting started.
Regards,
John
On Wed, Mar 09, 2011 at 05:51:44AM -0800, Mehmet Hakan Satman wrote:
Hi John,
I am sending the .h, the .c and the test.c file as attachments.
My implementation clusters a random data set with number of cases 10000,
the
number of parameters 10 and the number of clusters 10 in average 0.45
seconds.
Algorithm starts with the randomly selected cluster centers and in each
single
steps these centers are changed. Distances between these centers and the
observations must be calculated in every step. I am not sure if it can be
implemented a "create-one-and-use-anywhere" distances in this algorithm.
In
hierarchial cluster algorithms such as "single linkage clustering", it is
possible to build a distance matrix and use it during the clustering.
Best Regards.
Hakan.
________________________________
From: John Darrington <address@hidden>
To: Mehmet Hakan Satman <address@hidden>
Cc: address@hidden
Sent: Wed, March 9, 2011 3:21:59 PM
Subject: Re: K-Means Clustering
Hi Mehmet!
As I mentioned, the CLUSTER command is soemthing which I think it would be
great to support.
One issue with clustering is its memory complexity. It requires O(n^2)
where
n is the number of cases being clustered.
Have you tested your algorithm with large numbers of cases?
Maybe Ben has some ideas how an efficient distance matrix can be
implemented
in PSPP (maybe sparse-array.c can help?) .
In any case, I'd be interested to see your code, and the results of your
comparisons. Can you post them somewhere?
J'
On Tue, Mar 08, 2011 at 05:56:37AM -0800, Mehmet Hakan Satman wrote:
hi everybody,
I am interested in PSPP and i read about something about the needs
for
developing some functionality.
I implemented a k-means clustering library using the GNU scientific
library and
sent an informative e-mail to John. He suggested me to join this
group and
share
my ideas with the stuff.
I compared the results with SPSS outputs. The analysis of variance
table is
not
completed but we may add this feature.
I would be glad to integrate something to PSPP and work with you.
What do you think about this?
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
/*
kmeans - K-Means Clustering Library for the GNU PSPP (People Should
Prefer PSPP) project.
Copyright (C) 2011 Dr.Mehmet Hakan Satman <address@hidden>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
#include <stdio.h>
#include <math.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_statistics.h>
/*
Struct KMeans:
Holds all of the information for the functions.
*/
struct Kmeans {
gsl_matrix *data; //User Data (Given by the user)
gsl_matrix *centers; //Centers for groups
gsl_vector_int *index; //Integer values from zero to ngroups.
Shows group of an observation.
gsl_vector *v1,*v2,*v3; //Temporary vector for program. Do not use.
int ngroups; //Number of group. (Given by the user)
int n; //Number of observations. (Given by the
user)
int m; //Number of observations. (Given by the
user)
int maxiter; //Maximum number of iterations (Given by
the user)
int lastiter; //Show at which iteration it found the
solution.
double *weights; //Double values for handling weights for
program use.
};
/*
Creates a Kmeans structure for a given data 'data', number of observations
'n',
number of columns 'm', number of groups 'ngroups' (it is usually called
'k') and
number of maximum iterations 'maxiter'.
*/
struct Kmeans* kmeans_create(double* data, int n, int m, int ngroups, int
maxiter);
/*
Randomly chooses centers of 'ngroup' groups.
These centers are initial and will be changed iteratively.
*/
void kmeans_randomize_centers(struct Kmeans *kmeans);
/*
Prints the given Kmeans structure
*/
void kmeans_print(struct Kmeans* kmeans);
/*
Calculates the squared euclidean distance between vector v1 and v2
*/
double kmeans_euclidean_distance(gsl_vector *v1, gsl_vector *v2);
/*
Calculates and returns the number of elements contained in the
specific group.
*/
int kmeans_num_elements_group(struct Kmeans *kmeans, int group);
/*
Calculates group centers. Those centers are calculated using iteration
and they are usually different from the initial centers. Recalculation
process remains while the last two solutions are not equal.
*/
void kmeans_recalculate_centers(struct Kmeans *kmeans);
/*
Constructs the index variable. This variable shows the current
group of the each single observation.
*/
void kmeans_calculate_indexes(struct Kmeans *kmeans);
/*
Checks if the last two index variables are equal. If they are equal,
algorithm can not find a better classification anymore and stops.
*/
int kmeans_check_converge(gsl_vector_int *current, gsl_vector_int *old);
/*
This is the main method of the algorithm.
*/
void kmeans_cluster(struct Kmeans *kmeans);
#include "kmeans.h"
struct Kmeans*
kmeans_create(double* data,
int n,
int m,
int ngroups,
int maxiter){
int i,j;
struct Kmeans *k = (struct Kmeans*) malloc(sizeof(struct Kmeans));
k->data=gsl_matrix_alloc(n,m);
k->centers=gsl_matrix_alloc(ngroups, m);
k->ngroups=ngroups;
k->index=gsl_vector_int_alloc(n);
k->n=n;
k->m=m;
k->maxiter=maxiter;
k->lastiter=0;
for (i=0;i<n;i++){
for(j=0;j<m;j++){
gsl_matrix_set(k->data, i, j, data[i * m +j]);
}
}
k->weights = (double*)malloc(sizeof(double) * k->index->size);
k->v1 = gsl_vector_alloc(k->centers->size2);
k->v2 = gsl_vector_alloc(k->centers->size2);
k->v3 = gsl_vector_alloc(k->n);
return(k);
}
void
kmeans_randomize_centers(struct Kmeans *kmeans){
int i,j;
double min=0,max=0;
min=gsl_matrix_min(kmeans->data);
max=gsl_matrix_max(kmeans->data);
gsl_matrix_minmax(kmeans->data, &min, &max);
for (i=0;i<kmeans->centers->size1;i++){
for (j=0;j<kmeans->centers->size2; j++){
gsl_matrix_set(kmeans->centers, i, j, min +
(((double)rand())/RAND_MAX)*(max-min));
}
}
}
void
kmeans_print(struct Kmeans* kmeans){
int i,j;
printf("Number of groups: %d\n",kmeans->ngroups);
printf("Centers:\n");
for (i=0;i<kmeans->centers->size1;i++) {
for (j=0;j<kmeans->centers->size2;j++){
printf("%f ",gsl_matrix_get(kmeans->centers, i,j));
}
printf("\n");
}
printf("Index:\n");
for (i=0;i<kmeans->n;i++){
printf("%d ",gsl_vector_int_get(kmeans->index, i));
}
printf("\nLast iter: %d\n",kmeans->lastiter);
}
void print_matrix(gsl_matrix *m){
int i,j;
for (i=0;i<m->size1;i++){
for (j=0;j<m->size2;j++){
printf("%f ",m->data[i * m->size2 + j]);
}
printf("\n");
}
}
double
kmeans_euclidean_distance(gsl_vector *v1,
gsl_vector *v2){
double result=0.0;
double val;
int i;
for (i=0;i<v1->size;i++){
val=v1->data[i] - v2->data[i];
result+=val*val;
}
return(result);
}
int
kmeans_num_elements_group(struct Kmeans *kmeans, int group){
int total=0;
int i;
for (i=0;i<kmeans->n;i++){
if(gsl_vector_int_get(kmeans->index,i)==group) total++;
}
return(total);
}
void
kmeans_recalculate_centers(struct Kmeans *kmeans){
int i,j,h;
int elm;
double mean;
gsl_vector *v1=kmeans->v3;
for (i=0;i<kmeans->ngroups;i++){
elm=kmeans_num_elements_group(kmeans,i);
for (j=0;j<kmeans->index->size;j++){
if(gsl_vector_int_get(kmeans->index,j)==i){
kmeans->weights[j]=1.0;
}else{
kmeans->weights[j]=0.0;
}
}
for (h=0;h<kmeans->m;h++){
gsl_matrix_get_col(v1,kmeans->data, h);
mean=gsl_stats_wmean(kmeans->weights, 1, v1->data ,1,
v1->size);
gsl_matrix_set(kmeans->centers, i,h, mean);
}
}
}
void
kmeans_calculate_indexes(struct Kmeans *kmeans){
double dist;
double mindist;
int bestindex=0;
int i,j;
gsl_vector *v1 = kmeans->v1;
gsl_vector *v2 = kmeans->v2;
for (i=0;i<kmeans->n; i++){
mindist=INFINITY;
gsl_matrix_get_row(v1, kmeans->data, i);
for (j=0;j<kmeans->ngroups; j++){
gsl_matrix_get_row(v2, kmeans->centers,j);
dist=kmeans_euclidean_distance(v1,v2);
if(dist<mindist){
mindist=dist;
bestindex=j;
}
}
gsl_vector_int_set(kmeans->index,i,bestindex);
}
}
int
kmeans_check_converge(gsl_vector_int *current,
gsl_vector_int *old){
int i;
int total=0;
for (i=0;i<current->size;i++) {
if(current->data[i] == old->data[i]) total++;
old->data[i]=current->data[i];
}
return(current->size-total);
}
gsl_matrix*
kmeans_getGroup(struct Kmeans *kmeans, int index){
int i;
int j=0;
int elm=kmeans_num_elements_group(kmeans,index);
gsl_matrix *agroup=gsl_matrix_alloc(elm, kmeans->data->size2);
gsl_vector *v1=gsl_vector_alloc(kmeans->data->size2);
for(i=0;i<kmeans->data->size1;i++){
if(kmeans->index->data[i]==index){
gsl_matrix_get_row(v1, kmeans->data, i);
gsl_matrix_set_row(agroup, j, v1);
j++;
}
}
return(agroup);
}
void
kmeans_cluster(struct Kmeans *kmeans){
int i;
gsl_vector_int *oldindex = gsl_vector_int_alloc(kmeans->index->size);
kmeans_randomize_centers(kmeans);
for (i=0;i<kmeans->maxiter;i++){
kmeans->lastiter=i;
kmeans_calculate_indexes(kmeans);
kmeans_recalculate_centers(kmeans);
if(kmeans_check_converge(kmeans->index, oldindex)==0) break;
}
}
#include "kmeans.h"
int
main()
{
srand(12345);
int i,j;
double val;
int n=10000; //Number of cases
int m=10; //Number of variables
int k=10; //Number of groups
int maxiter=100;
double * data = (double*) malloc(n* m * sizeof(double));
for (i=0;i<n;i++){
for (j=0;j<m;j++){
val= ((float)rand())/(float)RAND_MAX;
data[i * m + j]=val*(i+j);
}
}
struct Kmeans *km = kmeans_create(data, n, m ,k, maxiter);
kmeans_cluster(km);
//kmeans_print(km);
return 0;
}
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature
- K-Means Clustering, Mehmet Hakan Satman, 2011/03/08
- Re: K-Means Clustering, John Darrington, 2011/03/09
- Re: K-Means Clustering, Duane Currie, 2011/03/09
- Re: K-Means Clustering, Ben Pfaff, 2011/03/09
- Message not available
- Re: K-Means Clustering,
John Darrington <=
- Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/10
- Re: K-Means Clustering, John Darrington, 2011/03/10
- Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/10
- Re: K-Means Clustering, Mehmet Hakan Satman, 2011/03/11
- Re: K-Means Clustering, John Darrington, 2011/03/12