Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.</p> <blockquote> <p><em>How would I determine if my [clustering] algorithms works correctly?</em></p> </blockquote> <p>k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"</p> <p>Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just <em>this ratio</em>: </p> <p><strong><em>inter-centroidal separation</em></strong> / <strong><em>intra-cluster variance</em></strong></p> <p><em>As the value of this ratio increase, the quality of your clustering result increases.</em> </p> <p>This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)? </p> <p>But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster. </p> <p>In sum, the <em>ratio of inter-centroidal separation to intra-cluster variance</em> is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k). </p> <p>The desired result is tight (small) clusters, each one far away from the others.</p> <p>The calculation is simple:</p> <p>For <em>inter-centroidal separation</em>:</p> <ul> <li><p>calculate the pair-wise distance between cluster centers; then</p></li> <li><p>calculate the median of those distances. <br/><br/></p></li> </ul> <p>For <em>intra-cluster variance</em>:</p> <ul> <li><p>for each cluster, calculate the distance of every data point in a given cluster from its cluster center; next</p></li> <li><p>(for each cluster) calculate the variance of the sequence of distances from the step above; then</p></li> <li><p>average these variance values.</p></li> </ul> <hr> <p>That's my answer to the first question. Here's the second question:</p> <blockquote> <p><em>Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?</em></p> </blockquote> <p>First, the easy question--is Euclidean distance a valid metric as dimensions/features increase? </p> <p>Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:</p> <ul> <li><p>subtract their feature vectors element-wise,</p></li> <li><p>square each item in that result vector,</p></li> <li><p>sum that result,</p></li> <li><p>take the square root of that scalar.</p></li> </ul> <p>Nowhere in this sequence of calculations is scale implicated.</p> <p>But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.</p> <p>In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.</p> <p><strong>To identify an appropriate similarity metric given your data:</strong></p> <p><img src="https://i.stack.imgur.com/mLMvx.png" alt="enter image description here"></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload