Note that there are some explanatory texts on larger screens.

plurals
  1. POCalculating the percentage of variance measure for k-means?
    primarykey
    data
    text
    <p>On the <a href="http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set" rel="noreferrer">Wikipedia page</a>, an elbow method is described for determining the number of clusters in k-means. <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans" rel="noreferrer">The built-in method of scipy</a> provides an implementation but I am not sure I understand how the distortion as they call it, is calculated. </p> <blockquote> <p>More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.</p> </blockquote> <p>Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure? </p> <pre><code>points = numpy.array([[ 0, 0], [ 0, 1], [ 0, -1], [ 1, 0], [-1, 0], [ 9, 9], [ 9, 10], [ 9, 8], [10, 9], [10, 8]]) kmeans(pp,2) (array([[9, 8], [0, 0]]), 0.9414213562373096) </code></pre> <p>I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points? </p> <p>In short, my questions (all related) are the following:</p> <ul> <li>Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?</li> <li>How would the methodology change if a different distance function such as cosine similarity is used?</li> </ul> <p><strong>EDIT 2: Distortion</strong></p> <pre><code>from scipy.spatial.distance import cdist D = cdist(points, centroids, 'euclidean') sum(numpy.min(D, axis=1)) </code></pre> <p>The output for the first set of points is accurate. However, when I try a different set:</p> <pre><code>&gt;&gt;&gt; pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]]) &gt;&gt;&gt; kmeans(pp, 2) (array([[6, 7], [1, 2]]), 1.1330618877807475) &gt;&gt;&gt; centroids = numpy.array([[6,7], [1,2]]) &gt;&gt;&gt; D = cdist(points, centroids, 'euclidean') &gt;&gt;&gt; sum(numpy.min(D, axis=1)) 9.0644951022459797 </code></pre> <p>I guess the last value does not match because <code>kmeans</code> seems to be diving the value by the total number of points in the dataset.</p> <p><strong>EDIT 1: Percent Variance</strong></p> <p>My code so far (should be added to Denis's K-means implementation):</p> <pre><code>centres, xtoc, dist = kmeanssample( points, 2, nsample=2, delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 ) print "Unique clusters: ", set(xtoc) print "" cluster_vars = [] for cluster in set(xtoc): print "Cluster: ", cluster truthcondition = ([x == cluster for x in xtoc]) distances_inside_cluster = (truthcondition * dist) indices = [i for i,x in enumerate(truthcondition) if x == True] final_distances = [distances_inside_cluster[k] for k in indices] print final_distances print np.array(final_distances).var() cluster_vars.append(np.array(final_distances).var()) print "" print "Sum of variances: ", sum(cluster_vars) print "Total Variance: ", points.var() print "Percent: ", (100 * sum(cluster_vars) / points.var()) </code></pre> <p>And following is the output for k=2:</p> <pre><code>Unique clusters: set([0, 1]) Cluster: 0 [1.0, 2.0, 0.0, 1.4142135623730951, 1.0] 0.427451660041 Cluster: 1 [0.0, 1.0, 1.0, 1.0, 1.0] 0.16 Sum of variances: 0.587451660041 Total Variance: 21.1475 Percent: 2.77787757437 </code></pre> <p>On my real dataset (does not look right to me!):</p> <pre><code>Sum of variances: 0.0188124746402 Total Variance: 0.00313754329764 Percent: 599.592510943 Unique clusters: set([0, 1, 2, 3]) Sum of variances: 0.0255808508714 Total Variance: 0.00313754329764 Percent: 815.314672809 Unique clusters: set([0, 1, 2, 3, 4]) Sum of variances: 0.0588210052519 Total Variance: 0.00313754329764 Percent: 1874.74720416 Unique clusters: set([0, 1, 2, 3, 4, 5]) Sum of variances: 0.0672406353655 Total Variance: 0.00313754329764 Percent: 2143.09824556 Unique clusters: set([0, 1, 2, 3, 4, 5, 6]) Sum of variances: 0.0646291452839 Total Variance: 0.00313754329764 Percent: 2059.86465055 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7]) Sum of variances: 0.0817517362176 Total Variance: 0.00313754329764 Percent: 2605.5970695 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8]) Sum of variances: 0.0912820650486 Total Variance: 0.00313754329764 Percent: 2909.34837831 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Sum of variances: 0.102119601368 Total Variance: 0.00313754329764 Percent: 3254.76309585 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) Sum of variances: 0.125549475536 Total Variance: 0.00313754329764 Percent: 4001.52168834 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) Sum of variances: 0.138469402779 Total Variance: 0.00313754329764 Percent: 4413.30651542 Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload