Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>There are several tricks to make <em>k</em>-means work for text:</p> <ol> <li>Get rid of the terms that occur in only a few documents (that have low df). These artificially blow up the distances in vector space.</li> <li>Normalize vectors. That helps a bit, since it gets rid of length differences between documents. It also makes document vectors have similar variances, as pointed out by @Anony-Mousse.</li> <li>Perform dimensionality reduction using <a href="http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf" rel="noreferrer">LSA</a>, aka truncated <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition" rel="noreferrer">SVD</a>, before doing the actual clustering. That helps a lot. (Be sure to normalize the LSA results as well.)</li> </ol> <p>Short explanation of why normalization works: suppose you have three documents {d₁, d₂, d₃}, and the tiny vocabulary {cat, dog, tax}. The term-document matrix (raw counts or tf-idf, doesn't matter) looks like</p> <pre><code> | cat | dog | tax d₁ | 100 | 100 | 0 d₂ | 10 | 10 | 0 d₃ | 0 | 0 | 100 </code></pre> <p>Now we're going to do 2-means. We can reasonably expect to find a pets cluster {d₁, d₂} and a finance singleton cluster {d₃}. However, the distances between the pairs are</p> <pre><code>D(d₁, d₂) = 127.28 D(d₁, d₃) = 173.21 D(d₂, d₃) = 101.00 </code></pre> <p>so a density-based method like <em>k</em>-means will tend to group d₂ with d₃. By normalizing the vectors, you effectively map both d₁ and d₂ to the same vector [0.71, 0.71, 0] so D(d₁, d₂) = 0 and they will always be in the same cluster.</p> <p>(<em>k</em>-means applied to normalized vectors is sometimes called "spherical" <em>k</em>-means because unit vectors lie on a hypersphere centered at the origin.)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload