StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCalculating the percentage of variance measure for k-means?
primarykey
Id
6645895
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2011-07-11T04:55:55.377
FavoriteCount
20
LastActivityDate
2017-06-27T19:17:23.293
LastEditDate
2012-02-28T06:54:21.787
LastEditorUserId
1060350
OwnerUserId
184046
ParentId
0
PostTypeId
1
Score
34
ViewCount
30685
LastEditorDisplayName
text
Body
On the <a href="http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set" rel="noreferrer">Wikipedia page</a>, an elbow method is described for determining the number of clusters in k-means. <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans" rel="noreferrer">The built-in method of scipy</a> provides an implementation but I am not sure I understand how the distortion as they call it, is calculated. <blockquote> More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. </blockquote> Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure? <pre><code>points = numpy.array([[ 0, 0], [ 0, 1], [ 0, -1], [ 1, 0], [-1, 0], [ 9, 9], [ 9, 10], [ 9, 8], [10, 9], [10, 8]]) kmeans(pp,2) (array([[9, 8], [0, 0]]), 0.9414213562373096) </code></pre> I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points? In short, my questions (all related) are the following: <ul> <li>Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?</li> <li>How would the methodology change if a different distance function such as cosine similarity is used?</li> </ul> EDIT 2: Distortion <pre><code>from scipy.spatial.distance import cdist D = cdist(points, centroids, 'euclidean') sum(numpy.min(D, axis=1)) </code></pre> The output for the first set of points is accurate. However, when I try a different set: <pre><code>>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]]) >>> kmeans(pp, 2) (array([[6, 7], [1, 2]]), 1.1330618877807475) >>> centroids = numpy.array([[6,7], [1,2]]) >>> D = cdist(points, centroids, 'euclidean') >>> sum(numpy.min(D, axis=1)) 9.0644951022459797 </code></pre> I guess the last value does not match because <code>kmeans</code> seems to be diving the value by the total number of points in the dataset. EDIT 1: Percent Variance My code so far (should be added to Denis's K-means implementation): <pre><code>centres, xtoc, dist = kmeanssample( points, 2, nsample=2, delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 ) print "Unique clusters: ", set(xtoc) print "" cluster_vars = [] for cluster in set(xtoc): print "Cluster: ", cluster truthcondition = ([x == cluster for x in xtoc]) distances_inside_cluster = (truthcondition * dist) indices = [i for i,x in enumerate(truthcondition) if x == True] final_distances = [distances_inside_cluster[k] for k in indices] print final_distances print np.array(final_distances).var() cluster_vars.append(np.array(final_distances).var()) print "" print "Sum of variances: ", sum(cluster_vars) print "Total Variance: ", points.var() print "Percent: ", (100 * sum(cluster_vars) / points.var()) </code></pre> And following is the output for k=2: <pre><code>Unique clusters: set([0, 1]) Cluster: 0 [1.0, 2.0, 0.0, 1.4142135623730951, 1.0] 0.427451660041 Cluster: 1 [0.0, 1.0, 1.0, 1.0, 1.0] 0.16 Sum of variances: 0.587451660041 Total Variance: 21.1475 Percent: 2.77787757437 </code></pre> On my real dataset (does not look right to me!): <pre><code>Sum of variances: 0.0188124746402 Total Variance: 0.00313754329764 Percent: 599.592510943 Unique clusters: set([0, 1, 2, 3]) Sum of variances: 0.0255808508714 Total Variance: 0.00313754329764 Percent: 815.314672809 Unique clusters: set([0, 1, 2, 3, 4]) Sum of variances: 0.0588210052519 Total Variance: 0.00313754329764 Percent: 1874.74720416 Unique clusters: set([0, 1, 2, 3, 4, 5]) Sum of variances: 0.0672406353655 Total Variance: 0.00313754329764 Percent: 2143.09824556 Unique clusters: set([0, 1, 2, 3, 4, 5, 6]) Sum of variances: 0.0646291452839 Total Variance: 0.00313754329764 Percent: 2059.86465055 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7]) Sum of variances: 0.0817517362176 Total Variance: 0.00313754329764 Percent: 2605.5970695 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8]) Sum of variances: 0.0912820650486 Total Variance: 0.00313754329764 Percent: 2909.34837831 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Sum of variances: 0.102119601368 Total Variance: 0.00313754329764 Percent: 3254.76309585 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) Sum of variances: 0.125549475536 Total Variance: 0.00313754329764 Percent: 4001.52168834 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) Sum of variances: 0.138469402779 Total Variance: 0.00313754329764 Percent: 4413.30651542 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) </code></pre>
Tags
<python><numpy><statistics><cluster-analysis><k-means>
Title
Calculating the percentage of variance measure for k-means?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USAnony-Mousse
UserOwnerUserId
1. USLegend
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POCalculating the percentage of variance measure for k-means?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POCalculating the percentage of variance measure for k-means?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POCalculating the percentage of variance measure for k-means?
 UserUserId
 USEric Lebigot
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. COI'll try to answer this evening :-)
 singulars
 PostPostId
 POCalculating the percentage of variance measure for k-means?
 UserUserId
 USstrauberry
2. CO@strauberry: Thank you :)
 singulars
 PostPostId
 POCalculating the percentage of variance measure for k-means?
 UserUserId
 USLegend

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.