Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>If you add a document to a collection of documents, unless that document is exactly the same as the document collection, the distribution of words or terms in your distribution is going to change to accommodate the newly added words. The question arises: "Is that really what you want to do with the third document?"</p> <p><a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence" rel="nofollow">Kullback-Leibler divergence</a> is a measure of divergence for two distributions. What are you two distributions?</p> <p>If your distribution is the probability of a certain word being selected at random in a document, then the space over which you have probability values is the collection of words which make up your documents. For your first two documents (I assume this is your entire collection), you can build a word-space of 7 terms. The Probability for a word being selected at random from the documents as bags of words are:</p> <pre><code> doc 1 doc 2 doc 3 (lem) answers 0.2 0.2 0.0 0.2 are 0.2 0.2 0.0 0.2 cross 0.2 0.0 .33 0.2 good 0.2 0.0 .33 0.2 nice 0.0 0.2 0.0 0.0 simply 0.0 0.2 .33 0.2 validated 0.2 0.2 0.0 0.0 </code></pre> <p>[This is calculated as the term-frequency divided by the document lengths. Notice that the new document has word forms that aren't the same as the words in doc 1 and doc 2. The (lem) column would be the probabilities if you stemmed or lemmatized to the same term the pairs (are/is) and (answer/answers).]</p> <p>Introducing the third document into the scenario, a typical activity you might want to do with Kullback-Liebler Divergence is compare a new document or collection of documents with already-known documents or collections of documents.</p> <p>Computing the Kullback-Liebler divergence <code>D(P||Q)</code> produces a value signifying how well the true distribution <code>P</code> is captured by using the substitute distribution <code>Q</code>. So <code>Q1</code> could be the distribution of words in doc 1, and <code>Q2</code> could be the distribution of words in doc 2. Computing the KL divergence with <code>P</code> being the distribution of words in the new document (doc 3), you can get measures of how divergent the new document is from doc 1 and how divergent it is from doc 2. Using this information, you can say how similar the new document is to your know documents/collections.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COThank you so much Atreys for your detail/clear answer. I have a question here: what is the difference between TF and probability? it always confuses me. Can we use simple TF to calculate probability distribution of a document? I heard there are some ways to normalise a document for probability distribution, do you have any idea what is it (normalization)?
      singulars
    2. COTerm frequency is the frequency of terms in a document. If the term "dog" appears three times in a document, then the term frequency is 3. If the document has 8000 terms in it, the probability of the term being chosen at random from the document is 3/8000. For IR, a more useful calculation is TF-IDF, which is Term-frequency over document frequency. If the term "dog" only appears 8 times in your corpus, then the TF-IDF is going to be 3/8 -- potentially highly significant, if you have a hundred or so documents. The probability distribution I showed was the probability of a ...
      singulars
    3. CO... term being chosen from a document if you were choosing words by just going to a random indexed word in the document and looking at it. Dividing the TF by the document length is what I did. That's the normalization used in probability vectors: divide the vector by the sum of the components, so everything adds up to 1. If you haven't a book on information retrieval, I found [Introduction to Information Retrieval](http://nlp.stanford.edu/IR-book/information-retrieval-book.html) to be very accessible as an easy to follow intro to the field.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload