[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: K-means cluster center order
From: |
John Darrington |
Subject: |
Re: K-means cluster center order |
Date: |
Sun, 31 May 2015 03:43:31 +0200 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Sat, May 30, 2015 at 05:38:26PM -0500, Alan Mead wrote:
I've uploaded a patch (against quick-cluster.c in 0.8.4) that adds
support for the /PRINT=CLUSTER subcommand for k-means clustering to show
the cluster membership for each case:
https://savannah.gnu.org/bugs/index.php?41019
But this patch has a remaining bug. The clusters centers are saved in
some indirect fashion that I cannot understand.
In the patch, I report the cluster number returned by
kmeans_get_nearest_group() but these cluster numbers are systematically
different from the reported cluster numbers. That is, the centers are
stored internally in arbitrary order (as they are discovered, I'd guess)
and for purposes of reporting, they are numbered. I cannot replicate
that output numbering.
For example, in the attached output, the centers were (10,10),
(-10,-10), and (-10,10) and 20 cases were generated for each cluster.
The CLUSTER command reports 1 = (-10.23, -10.01), 2 = (-10.19, 10.18)
and 3=(10.27, 9.82) so the first 20 cases should be members of cluster
3, the next 20 from cluster 3 and the last 20 from cluster 2. But using
the results from kmeans_get_nearest_group(), the clusters are reported
as 1, then 3, then 2.
I don't understand how I can fix this. I think I need to use
kmeans->group_order which is a "gsl_permutation" but this is beyond my
familiarity with C and GSL.
A gsl_permutation is nothing more than an array of ints. See the GSL
documentation
here: https://www.gnu.org/software/gsl/manual/html_node/Permutations.html
So perhaps you need a line like :
clust = kmeans->group_order->data[clust];
or
clust = gsl_permutation_get (kmeans->group_order, clust);
which does the same thing and is cleaner.
It's also possible that kmeans_order_groups() (which is called at the
beginning of quick_cluster_show_results()) is not working properly.
It looks as if kmeans_order_groups simply initialises this gsl_permutation
such that the cluster reporting order follows the value of the center of the
first variable. That is to say, it ensures that in the output table:
Final Cluster Centers
#=#===================#
# | Cluster #
# +------+------+-----#
# | 1 | 2 | 3 #
# +------+------+-----#
#x|-10.23|-10.19|10.27#
#y|-10.01| 10.18| 9.81#
#=#======#======#=====#
the values in the x row are monotonically increasing.
I hope this helps.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
signature.asc
Description: Digital signature