Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>If you explicitly want to use k-means you could study the article describing <a href="http://staff.utia.cas.cz/nagy/skola/Projekty/Classification/Xmeans.pdf" rel="nofollow noreferrer">x-means</a>. When using an implementation of x-means the only difference compared to k-means, is that rather than specifying a single k, you specify a range for k. The "best" choice, wrt. some measure, in the range will be part of the output from x-means. You can also look into the <a href="http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf" rel="nofollow noreferrer">Mean Shift</a> clustering algorithm.</p> <p>If it is computationally feasible with your given data (possibly using sampling as yura suggests), you could do clustering with various k's and evalute the quality of the resulting clusters using some of the standard cluster validity measures. Some of the classic measures are described here: <a href="http://machaon.karanagai.com/validation_algorithms.html" rel="nofollow noreferrer">measures</a>.</p> <p>@doug It is not correct that k-means++ determines an optimal k for the number of clusters before cluster assignments start. k-means++ differs from k-means only by instead of randomly choosing the initial k centroids, it chooses one initial centroid randomly and successively chooses centers until k has been chosen. After the initial completely random choice, data points are chosen as a new centroid with a probability that is determined by a potential function which depends on the datapoint's distance to the already chosen centers. The standard reference for k-means++ is <a href="http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf" rel="nofollow noreferrer">k-means++: The Advantages of Careful Seeding</a> by Arthur and Vassilvitskii.</p> <p>Also, I don't think that in general choosing k to be the number of principal components will improve your clustering. Imagine data points in three-dimensional space all lying in a plane passing through the origo. You will then get 2 principal components, but the "natural" clustering of the points could have any number of clusters.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload