Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>A Hamming distance should be done between two strings of equal length and with the order taken into account.</p> <p>As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)</p> <p>Here is a cosine similarity function of 2 arrays of words:</p> <pre><code>function cosineSimilarity($tokensA, $tokensB) { $a = $b = $c = 0; $uniqueTokensA = $uniqueTokensB = array(); $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB)); foreach ($tokensA as $token) $uniqueTokensA[$token] = 0; foreach ($tokensB as $token) $uniqueTokensB[$token] = 0; foreach ($uniqueMergedTokens as $token) { $x = isset($uniqueTokensA[$token]) ? 1 : 0; $y = isset($uniqueTokensB[$token]) ? 1 : 0; $a += $x * $y; $b += $x; $c += $y; } return $b * $c != 0 ? $a / sqrt($b * $c) : 0; } </code></pre> <p>It is fast (<code>isset()</code> instead of <code>in_array()</code> is a killer on large arrays).</p> <p>As you can see, the results does not take into account the "magnitude" of each the word.</p> <p>I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)</p> <p><strong>The best link about string similarity metrics</strong>: <a href="http://www.dcs.shef.ac.uk/~sam/stringmetrics.html" rel="noreferrer">http://www.dcs.shef.ac.uk/~sam/stringmetrics.html</a></p> <p>For further interesting readings:</p> <p><a href="http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html" rel="noreferrer">http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html</a> <a href="http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298" rel="noreferrer">http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload