Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.</p> <p>What your looking for is called Information Retrieval</p> <p>It usually uses the Bag Of Words model </p> <p>Say you have two documents:</p> <pre><code>DOCUMENT A Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again </code></pre> <p>and this one</p> <pre><code>DOCUMENT B Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything </code></pre> <p>and you have a query, or something you want to find other relevant documents for</p> <pre><code>QUERY aka DOCUMENT C precious wonderful life </code></pre> <p>Anyways, how do you calculate the most "relevant" of the two documents? Here's how:</p> <ol> <li>tokenize each document (break into words, removing all non letters)</li> <li>lowercase everything</li> <li>remove stopwords (and, the etc)</li> <li>consider stemming (removing the suffix, see Porter or Snowball stemming algorithms)</li> <li>consider using n-grams</li> </ol> <p>You can count the word frequency, to get the "keywords". </p> <p>Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.</p> <p>Now you have this:</p> <pre><code>Doc precious worf life... A 0.5 0.0 0.2 B 0.0 0.9 0.0 C 0.7 0.0 0.9 </code></pre> <p>Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant. </p> <p>Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.</p> <p>You can see my video here. It uses a graphical Java tool, but explains the concepts:</p> <p><a href="http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html" rel="nofollow">http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html</a></p> <p>here is a decent IR book:</p> <p><a href="http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf" rel="nofollow">http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload