Note that there are some explanatory texts on larger screens.

plurals
  1. POPython - Clustering with K-means. Some columns with zero variance
    text
    copied!<p>I have a data set consisting of ~200 99x20 arrays of frequencies, with each column summing to unity. I have plotted these using heatmaps like <img src="https://imgur.com/EYZuYfB.png" alt="this">. Each array is pretty sparse, with only about 1-7/20 values per 99 positions being nonzero.</p> <p>However, I would like to cluster these samples in terms of how similar their frequency profiles are (minimum euclidean distance or something like that). I have arranged each 99x20 array into a 1980x1 array and aggregated them into a 200x1980 observation array. </p> <p>Before finding the clusters, I have tried whitening the data using <code>scipy.cluster.vq.whiten</code>. <code>whiten</code> normalizes each column by its variance, but due to the way I've flattened my data arrays, I have some (8) columns with all zero frequencies, so the variance is zero. Therefore the whitened array has infinite values and the centroid finding fails (or gives ~200 centroids).</p> <p>My question is, how should I go about resolving this? So far, I've tried</p> <ul> <li>Don't whiten the data. This causes k-means to give different centroids every time it's run (somewhat expected), despite increasing the <code>iter</code> keyword considerably.</li> <li>Transposing the arrays before I flatten them. The zero variance columns just shift. </li> </ul> <p>Is it ok to just delete some of these zero variance columns? Would this bias the clustering in any way?</p> <p>EDIT: I have also tried using my own whiten function which just does</p> <pre><code>for i in range(arr.shape[1]): if np.abs(arr[:,i].std()) &lt; 1e-8: continue arr[:,i] /= arr[:,i].std() </code></pre> <p>This seems to work, but I'm not sure if this is biasing the clustering in any way.</p> <p>Thanks</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload