Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>A Hamming distance should be done between two strings of equal length and with the order taken into account.</p> <p>As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)</p> <p>Here is a cosine similarity function of 2 arrays of words:</p> <pre><code>function cosineSimilarity($tokensA, $tokensB) { $a = $b = $c = 0; $uniqueTokensA = $uniqueTokensB = array(); $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB)); foreach ($tokensA as $token) $uniqueTokensA[$token] = 0; foreach ($tokensB as $token) $uniqueTokensB[$token] = 0; foreach ($uniqueMergedTokens as $token) { $x = isset($uniqueTokensA[$token]) ? 1 : 0; $y = isset($uniqueTokensB[$token]) ? 1 : 0; $a += $x * $y; $b += $x; $c += $y; } return $b * $c != 0 ? $a / sqrt($b * $c) : 0; } </code></pre> <p>It is fast (<code>isset()</code> instead of <code>in_array()</code> is a killer on large arrays).</p> <p>As you can see, the results does not take into account the "magnitude" of each the word.</p> <p>I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)</p> <p><strong>The best link about string similarity metrics</strong>: <a href="http://www.dcs.shef.ac.uk/~sam/stringmetrics.html" rel="noreferrer">http://www.dcs.shef.ac.uk/~sam/stringmetrics.html</a></p> <p>For further interesting readings:</p> <p><a href="http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html" rel="noreferrer">http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html</a> <a href="http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298" rel="noreferrer">http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload