Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>A few Ideas, that you could potentially undertake or investigate further are:</p> <ol> <li><p>Indexing the documents and then searching for similar documents. So Open source Indexing/Search systems such as <a href="http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/" rel="nofollow noreferrer">Solr</a>, <a href="http://sphinxsearch.com/about/" rel="nofollow noreferrer">Sphinx</a> or <a href="http://framework.zend.com/manual/en/zend.search.lucene.overview.html" rel="nofollow noreferrer">Zend Search Lucene</a> could come in handy.</p></li> <li><p>You could use the <a href="http://matpalm.com/resemblance/simhash/" rel="nofollow noreferrer">sim hashing algorithm</a> or <a href="http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html" rel="nofollow noreferrer">shingling</a> . Briefly the simhash algorithm will let you compute similar hash values for similar documents. So you could then store this value against each document and check how similar various documents are.</p></li> </ol> <p><br/></p> <p>Other algorithms that you may find helpful to get some ideas from are:</p> <p>1 . <a href="http://en.wikipedia.org/wiki/Levenshtein_distance" rel="nofollow noreferrer">Levenshtein distance</a></p> <p>2 . <a href="http://en.wikipedia.org/wiki/Bayesian_spam_filtering" rel="nofollow noreferrer">Bayesian filtering</a> - <a href="https://stackoverflow.com/search?q=bayesian%20filtering">SO Questions re Bayesian filtering</a>. First link in this list item points to the Bayesian spam filtering article on Wiki, but this algorithm can be adapted to what you are trying to do.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload