[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: K-means cluster center order

From: John Darrington
Subject: Re: K-means cluster center order
Date: Sun, 31 May 2015 03:43:31 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Sat, May 30, 2015 at 05:38:26PM -0500, Alan Mead wrote:
     I've uploaded a patch (against quick-cluster.c in 0.8.4)  that adds
     support for the /PRINT=CLUSTER subcommand for k-means clustering to show
     the cluster membership for each case:
     But this patch has a remaining bug.  The clusters centers are saved in
     some indirect fashion that I cannot understand. 
     In the patch, I report the cluster number returned by
     kmeans_get_nearest_group() but these cluster numbers are systematically
     different from the reported cluster numbers.  That is, the centers are
     stored internally in arbitrary order (as they are discovered, I'd guess)
     and for purposes of reporting, they are numbered.  I cannot replicate
     that output numbering.
     For example, in the attached output, the centers were (10,10),
     (-10,-10), and (-10,10) and 20 cases were generated for each cluster. 
     The CLUSTER command reports 1 = (-10.23, -10.01), 2 = (-10.19, 10.18)
     and 3=(10.27, 9.82) so the first 20 cases should be members of cluster
     3, the next 20 from cluster 3 and the last 20 from cluster 2.  But using
     the results from kmeans_get_nearest_group(), the clusters are reported
     as 1, then 3, then 2.
     I don't understand how I can fix this.  I think I need to use
     kmeans->group_order which is a "gsl_permutation" but this is beyond my
     familiarity with C and GSL.

A gsl_permutation is nothing more than an array of ints.  See the GSL 

So perhaps you need a line like :  

  clust = kmeans->group_order->data[clust];

  clust = gsl_permutation_get (kmeans->group_order, clust);
which does the same thing and is cleaner.

     It's also possible that kmeans_order_groups() (which is called at the
     beginning of quick_cluster_show_results()) is not working properly.

It looks as if kmeans_order_groups simply initialises this gsl_permutation
such that the cluster reporting order follows the value of the center of the
first variable.  That is to say,  it ensures that in the output table:

Final Cluster Centers
# |      Cluster      #
# +------+------+-----#
# |   1  |   2  |  3  #
# +------+------+-----#
#y|-10.01| 10.18| 9.81#

the values in the x row are monotonically increasing.

I hope this helps.


PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]