Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Here's a small kmeans that uses any of the 20-odd distances in <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html#scipy.spatial.distance.cdist" rel="nofollow noreferrer">scipy.spatial.distance</a>, or a user function.<br> Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?</p> <pre><code>#!/usr/bin/env python # kmeans.py using any of the 20-odd metrics in scipy.spatial.distance # kmeanssample 2 pass, first sample sqrt(N) from __future__ import division import random import numpy as np from scipy.spatial.distance import cdist # $scipy/spatial/distance.py # http://docs.scipy.org/doc/scipy/reference/spatial.html from scipy.sparse import issparse # $scipy/sparse/csr.py __date__ = "2011-11-17 Nov denis" # X sparse, any cdist metric: real app ? # centres get dense rapidly, metrics in high dim hit distance whiteout # vs unsupervised / semi-supervised svm #............................................................................... def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ): """ centres, Xtocentre, distances = kmeans( X, initial centres ... ) in: X N x dim may be sparse centres k x dim: initial centres, e.g. random.sample( X, k ) delta: relative error, iterate until the average distance to centres is within delta of the previous average distance maxiter metric: any of the 20-odd in scipy.spatial.distance "chebyshev" = max, "cityblock" = L1, "minkowski" with p= or a function( Xvec, centrevec ), e.g. Lqmetric below p: for minkowski metric -- local mod cdist for 0 &lt; p &lt; 1 too verbose: 0 silent, 2 prints running distances out: centres, k x dim Xtocentre: each X -&gt; its nearest centre, ints N -&gt; k distances, N see also: kmeanssample below, class Kmeans below. """ if not issparse(X): X = np.asanyarray(X) # ? centres = centres.todense() if issparse(centres) \ else centres.copy() N, dim = X.shape k, cdim = centres.shape if dim != cdim: raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % ( X.shape, centres.shape )) if verbose: print "kmeans: X %s centres %s delta=%.2g maxiter=%d metric=%s" % ( X.shape, centres.shape, delta, maxiter, metric) allx = np.arange(N) prevdist = 0 for jiter in range( 1, maxiter+1 ): D = cdist_sparse( X, centres, metric=metric, p=p ) # |X| x |centres| xtoc = D.argmin(axis=1) # X -&gt; nearest centre distances = D[allx,xtoc] avdist = distances.mean() # median ? if verbose &gt;= 2: print "kmeans: av |X - nearest centre| = %.4g" % avdist if (1 - delta) * prevdist &lt;= avdist &lt;= prevdist \ or jiter == maxiter: break prevdist = avdist for jc in range(k): # (1 pass in C) c = np.where( xtoc == jc )[0] if len(c) &gt; 0: centres[jc] = X[c].mean( axis=0 ) if verbose: print "kmeans: %d iterations cluster sizes:" % jiter, np.bincount(xtoc) if verbose &gt;= 2: r50 = np.zeros(k) r90 = np.zeros(k) for j in range(k): dist = distances[ xtoc == j ] if len(dist) &gt; 0: r50[j], r90[j] = np.percentile( dist, (50, 90) ) print "kmeans: cluster 50 % radius", r50.astype(int) print "kmeans: cluster 90 % radius", r90.astype(int) # scale L1 / dim, L2 / sqrt(dim) ? return centres, xtoc, distances #............................................................................... def kmeanssample( X, k, nsample=0, **kwargs ): """ 2-pass kmeans, fast for large N: 1) kmeans a random sample of nsample ~ sqrt(N) from X 2) full kmeans, starting from those centres """ # merge w kmeans ? mttiw # v large N: sample N^1/2, N^1/2 of that # seed like sklearn ? N, dim = X.shape if nsample == 0: nsample = max( 2*np.sqrt(N), 10*k ) Xsample = randomsample( X, int(nsample) ) pass1centres = randomsample( X, int(k) ) samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0] return kmeans( X, samplecentres, **kwargs ) def cdist_sparse( X, Y, **kwargs ): """ -&gt; |X| x |Y| cdist array, any cdist metric X or Y may be sparse -- best csr """ # todense row at a time, v slow if both v sparse sxy = 2*issparse(X) + issparse(Y) if sxy == 0: return cdist( X, Y, **kwargs ) d = np.empty( (X.shape[0], Y.shape[0]), np.float64 ) if sxy == 2: for j, x in enumerate(X): d[j] = cdist( x.todense(), Y, **kwargs ) [0] elif sxy == 1: for k, y in enumerate(Y): d[:,k] = cdist( X, y.todense(), **kwargs ) [0] else: for j, x in enumerate(X): for k, y in enumerate(Y): d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0] return d def randomsample( X, n ): """ random.sample of the rows of X X may be sparse -- best csr """ sampleix = random.sample( xrange( X.shape[0] ), int(n) ) return X[sampleix] def nearestcentres( X, centres, metric="euclidean", p=2 ): """ each X -&gt; nearest centre, any metric euclidean2 (~ withinss) is more sensitive to outliers, cityblock (manhattan, L1) less sensitive """ D = cdist( X, centres, metric=metric, p=p ) # |X| x |centres| return D.argmin(axis=1) def Lqmetric( x, y=None, q=.5 ): # yes a metric, may increase weight of near matches; see ... return (np.abs(x - y) ** q) .mean() if y is not None \ else (np.abs(x) ** q) .mean() #............................................................................... class Kmeans: """ km = Kmeans( X, k= or centres=, ... ) in: either initial centres= for kmeans or k= [nsample=] for kmeanssample out: km.centres, km.Xtocentre, km.distances iterator: for jcentre, J in km: clustercentre = centres[jcentre] J indexes e.g. X[J], classes[J] """ def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ): self.X = X if centres is None: self.centres, self.Xtocentre, self.distances = kmeanssample( X, k=k, nsample=nsample, **kwargs ) else: self.centres, self.Xtocentre, self.distances = kmeans( X, centres, **kwargs ) def __iter__(self): for jc in range(len(self.centres)): yield jc, (self.Xtocentre == jc) #............................................................................... if __name__ == "__main__": import random import sys from time import time N = 10000 dim = 10 ncluster = 10 kmsample = 100 # 0: random centres, &gt; 0: kmeanssample kmdelta = .001 kmiter = 10 metric = "cityblock" # "chebyshev" = max, "cityblock" L1, Lqmetric seed = 1 exec( "\n".join( sys.argv[1:] )) # run this.py N= ... np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True ) np.random.seed(seed) random.seed(seed) print "N %d dim %d ncluster %d kmsample %d metric %s" % ( N, dim, ncluster, kmsample, metric) X = np.random.exponential( size=(N,dim) ) # cf scikits-learn datasets/ t0 = time() if kmsample &gt; 0: centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample, delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 ) else: randomcentres = randomsample( X, ncluster ) centres, xtoc, dist = kmeans( X, randomcentres, delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 ) print "%.0f msec" % ((time() - t0) * 1000) # also ~/py/np/kmeans/test-kmeans.py </code></pre> <p>Some notes added 26mar 2012:</p> <p>1) for cosine distance, first normalize all the data vectors to |X| = 1; then</p> <pre><code>cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2 </code></pre> <p>is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don't know which programs do that.</p> <p>2) <a href="http://scikit-learn.org/stable/modules/clustering.html" rel="nofollow noreferrer">Scikit-learn clustering</a> gives an excellent overview of k-means, mini-batch-k-means ... with code that works on scipy.sparse matrices.</p> <p>3) Always check cluster sizes after k-means. If you're expecting roughly equal-sized clusters, but they come out <code>[44 37 9 5 5] %</code> ... (sound of head-scratching).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. CO+1 First of all, thank you for sharing your implementation. I just wanted to confirm that the algorithm works great for my dataset of 900 vectors in a 700 dimensional space. I was just wondering if it is also possible to evaluate the quality of the clusters generated. Can any of the values in your code be reused to compute the cluster quality to aid in selecting the number of optimal clusters?
      singulars
    2. COLegend, you're welcome. (Updated the code to print cluster 50 % / 90 % radius). "Cluster quality" is a largish topic: how many clusters do you have, do you have training samples with known clusters, e.g. from experts ? On number of clusters, see SO [how-do-i-determine-k-when-using-k-means-clustering](http://stackoverflow.com/questions/1793532/how-do-i-determine-k)-when-using-k-means-clustering
      singulars
    3. COThank you once again. Actually, I do not have the training samples but am trying to verify the clusters manually after classification (trying to play the role of the domain expert as well). I am performing a document-level classification after applying SVD to some original documents and reducing their dimension. The results seem good but I wasn't sure on how to validate them. For the initial stage, while exploring various cluster validity metrics, I came across Dunn's Index, Elbow method etc. wasn't really sure which one to utilize so thought I will start off with the Elbow method.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload