StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>If you add a document to a collection of documents, unless that document is exactly the same as the document collection, the distribution of words or terms in your distribution is going to change to accommodate the newly added words. The question arises: "Is that really what you want to do with the third document?"</p> <p><a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence" rel="nofollow">Kullback-Leibler divergence</a> is a measure of divergence for two distributions. What are you two distributions?</p> <p>If your distribution is the probability of a certain word being selected at random in a document, then the space over which you have probability values is the collection of words which make up your documents. For your first two documents (I assume this is your entire collection), you can build a word-space of 7 terms. The Probability for a word being selected at random from the documents as bags of words are:</p> <pre><code> doc 1 doc 2 doc 3 (lem) answers 0.2 0.2 0.0 0.2 are 0.2 0.2 0.0 0.2 cross 0.2 0.0 .33 0.2 good 0.2 0.0 .33 0.2 nice 0.0 0.2 0.0 0.0 simply 0.0 0.2 .33 0.2 validated 0.2 0.2 0.0 0.0 </code></pre> <p>[This is calculated as the term-frequency divided by the document lengths. Notice that the new document has word forms that aren't the same as the words in doc 1 and doc 2. The (lem) column would be the probabilities if you stemmed or lemmatized to the same term the pairs (are/is) and (answer/answers).]</p> <p>Introducing the third document into the scenario, a typical activity you might want to do with Kullback-Liebler Divergence is compare a new document or collection of documents with already-known documents or collections of documents.</p> <p>Computing the Kullback-Liebler divergence <code>D(P||Q)</code> produces a value signifying how well the true distribution <code>P</code> is captured by using the substitute distribution <code>Q</code>. So <code>Q1</code> could be the distribution of words in doc 1, and <code>Q2</code> could be the distribution of words in doc 2. Computing the KL divergence with <code>P</code> being the distribution of words in the new document (doc 3), you can get measures of how divergent the new document is from doc 1 and how divergent it is from doc 2. Using this information, you can say how similar the new document is to your know documents/collections.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload