K-means cluster center order

From: Alan Mead
Subject: K-means cluster center order
Date: Sat, 30 May 2015 17:38:26 -0500
I've uploaded a patch (against quick-cluster.c in 0.8.4)  that adds
support for the /PRINT=CLUSTER subcommand for k-means clustering to show
the cluster membership for each case:

But this patch has a remaining bug.  The clusters centers are saved in
some indirect fashion that I cannot understand. 

In the patch, I report the cluster number returned by
kmeans_get_nearest_group() but these cluster numbers are systematically
different from the reported cluster numbers.  That is, the centers are
stored internally in arbitrary order (as they are discovered, I'd guess)
and for purposes of reporting, they are numbered.  I cannot replicate
that output numbering.

For example, in the attached output, the centers were (10,10),
(-10,-10), and (-10,10) and 20 cases were generated for each cluster. 
The CLUSTER command reports 1 = (-10.23, -10.01), 2 = (-10.19, 10.18)
and 3=(10.27, 9.82) so the first 20 cases should be members of cluster
3, the next 20 from cluster 3 and the last 20 from cluster 2.  But using
the results from kmeans_get_nearest_group(), the clusters are reported
as 1, then 3, then 2.

I don't understand how I can fix this.  I think I need to use
kmeans->group_order which is a "gsl_permutation" but this is beyond my
familiarity with C and GSL.

It's also possible that kmeans_order_groups() (which is called at the
beginning of quick_cluster_show_results()) is not working properly.

Any advice?



