Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <ol> <li><p>First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the <code>MD5</code> hash (or something faster) of the resulting string. Do a database lookup of the <code>MD5</code> hash (as two 64 bit integers) in a table, if it exists, it is an <em>exact</em> duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.</p></li> <li><p>To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the <code>SpotSigs</code> paper and <a href="http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html" rel="noreferrer">blog post</a> by Greg Linden. Suppose the routine <code>Sigs()</code> does that for a given string, that is, given the normalized string <code>x</code>, <code>Sigs(x)</code> returns a small (1-5) set of 64 bit integers. You could use something like the <code>SpotSigs</code> algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is <a href="http://code.google.com/p/simhash/" rel="noreferrer">here</a>).</p></li> <li><p>Given the <code>Sigs()</code> the problem of efficiently finding the near duplicates is commonly called the <a href="http://bit.ly/JPTZ2I" rel="noreferrer">set similarity joins</a> problem. The <code>SpotSigs</code> paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the <code>simhash</code> method.</p></li> </ol>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload