Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The distortion, as far as <a href="http://en.wikipedia.org/wiki/K-means_clustering" rel="noreferrer">Kmeans</a> is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)</p> <p>If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using <a href="http://www.mathworks.com/help/stats/pdist2.html" rel="noreferrer"><code>pdist2</code></a> function, but it should be straightforward to rewrite in Python/Numpy/Scipy):</p> <pre class="lang-matlab prettyprint-override"><code>% data X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8]; % centroids C = [9 8 ; 0 0]; % euclidean distance from each point to each cluster centroid D = pdist2(X, C, 'euclidean'); % find closest centroid to each point, and the corresponding distance [distortions,idx] = min(D,[],2); </code></pre> <p>the result:</p> <pre class="lang-matlab prettyprint-override"><code>% total distortion &gt;&gt; sum(distortions) ans = 9.4142135623731 </code></pre> <hr> <h2>EDIT#1:</h2> <p>I had some time to play around with this.. Here is an example of KMeans clustering applied on the <a href="http://en.wikipedia.org/wiki/Iris_flower_data_set" rel="noreferrer">'Fisher Iris Dataset'</a> (4 features, 150 instances). We iterate over <code>k=1..10</code>, plot the elbow curve, pick <code>K=3</code> as number of clusters, and show a scatter plot of the result.</p> <p>Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html" rel="noreferrer"><code>scipy.cluster.vq.kmeans</code></a> function returns this measure by default (computed with Euclidean as a distance measure). You can also use the <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html" rel="noreferrer"><code>scipy.spatial.distance.cdist</code></a> function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: <a href="https://stackoverflow.com/questions/5529625/is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means">@Denis</a> have a solution for that), then compute the distortion from that.</p> <pre class="lang-py prettyprint-override"><code>import numpy as np from scipy.cluster.vq import kmeans,vq from scipy.spatial.distance import cdist import matplotlib.pyplot as plt # load the iris dataset fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt' fp = open(fName) X = np.loadtxt(fp) fp.close() ##### cluster data into K=1..10 clusters ##### K = range(1,10) # scipy.cluster.vq.kmeans KM = [kmeans(X,k) for k in K] centroids = [cent for (cent,var) in KM] # cluster centroids #avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares # alternative: scipy.cluster.vq.vq #Z = [vq(X,cent) for cent in centroids] #avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z] # alternative: scipy.spatial.distance.cdist D_k = [cdist(X, cent, 'euclidean') for cent in centroids] cIdx = [np.argmin(D,axis=1) for D in D_k] dist = [np.min(D,axis=1) for D in D_k] avgWithinSS = [sum(d)/X.shape[0] for d in dist] ##### plot ### kIdx = 2 # elbow curve fig = plt.figure() ax = fig.add_subplot(111) ax.plot(K, avgWithinSS, 'b*-') ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, markeredgewidth=2, markeredgecolor='r', markerfacecolor='None') plt.grid(True) plt.xlabel('Number of clusters') plt.ylabel('Average within-cluster sum of squares') plt.title('Elbow for KMeans clustering') # scatter plot fig = plt.figure() ax = fig.add_subplot(111) #ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k]) clr = ['b','g','r','c','m','y','k'] for i in range(K[kIdx]): ind = (cIdx[kIdx]==i) ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i) plt.xlabel('Petal Length') plt.ylabel('Sepal Width') plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx]) plt.legend() plt.show() </code></pre> <p><img src="https://i.stack.imgur.com/BzwBY.png" alt="elbow_curve"> <img src="https://i.stack.imgur.com/aOivI.png" alt="scatter_plot"></p> <hr> <h2>EDIT#2:</h2> <p>In response to the comments, I give below another complete example using the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html" rel="noreferrer">NIST hand-written digits dataset</a>: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: <a href="http://en.wikipedia.org/wiki/Principal_component_analysis" rel="noreferrer">Principal Components Analysis</a> is applied to reduce the dimensionality from 64 down to 2:</p> <pre class="lang-py prettyprint-override"><code>import numpy as np from scipy.cluster.vq import kmeans from scipy.spatial.distance import cdist,pdist from sklearn import datasets from sklearn.decomposition import RandomizedPCA from matplotlib import pyplot as plt from matplotlib import cm ##### data ##### # load digits dataset data = datasets.load_digits() t = data['target'] # perform PCA dimensionality reduction pca = RandomizedPCA(n_components=2).fit(data['data']) X = pca.transform(data['data']) ##### cluster data into K=1..20 clusters ##### K_MAX = 20 KK = range(1,K_MAX+1) KM = [kmeans(X,k) for k in KK] centroids = [cent for (cent,var) in KM] D_k = [cdist(X, cent, 'euclidean') for cent in centroids] cIdx = [np.argmin(D,axis=1) for D in D_k] dist = [np.min(D,axis=1) for D in D_k] tot_withinss = [sum(d**2) for d in dist] # Total within-cluster sum of squares totss = sum(pdist(X)**2)/X.shape[0] # The total sum of squares betweenss = totss - tot_withinss # The between-cluster sum of squares ##### plots ##### kIdx = 9 # K=10 clr = cm.spectral( np.linspace(0,1,10) ).tolist() mrk = 'os^p&lt;dvh8&gt;+x.' # elbow curve fig = plt.figure() ax = fig.add_subplot(111) ax.plot(KK, betweenss/totss*100, 'b*-') ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12, markeredgewidth=2, markeredgecolor='r', markerfacecolor='None') ax.set_ylim((0,100)) plt.grid(True) plt.xlabel('Number of clusters') plt.ylabel('Percentage of variance explained (%)') plt.title('Elbow for KMeans clustering') # show centroids for K=10 clusters plt.figure() for i in range(kIdx+1): img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8) ax = plt.subplot(3,4,i+1) ax.set_xticks([]) ax.set_yticks([]) plt.imshow(img, cmap=cm.gray) plt.title( 'Cluster %d' % i ) # compare K=10 clustering vs. actual digits (PCA projections) fig = plt.figure() ax = fig.add_subplot(121) for i in range(10): ind = (t==i) ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i) plt.legend() plt.title('Actual Digits') ax = fig.add_subplot(122) for i in range(kIdx+1): ind = (cIdx[kIdx]==i) ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i) plt.legend() plt.title('K=%d clusters'%KK[kIdx]) plt.show() </code></pre> <p><img src="https://i.stack.imgur.com/2HTnP.png" alt="elbow_curve"> <img src="https://i.stack.imgur.com/qjFQh.png" alt="digits_centroids"> <img src="https://i.stack.imgur.com/ZEgvY.png" alt="PCA_compare"></p> <p>You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.</p> <p>Note: An implementation of <a href="http://scikit-learn.org/stable/modules/clustering.html#k-means" rel="noreferrer">K-means</a> is included in <a href="http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html" rel="noreferrer"><code>scikit-learn</code></a> (as well as many other clustering algorithms and various <a href="http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation" rel="noreferrer">clustering metrics</a>). <a href="http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html" rel="noreferrer">Here</a> is another similar example.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload