StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<h3>N-grams</h3> <p>Use <a href="http://en.wikipedia.org/wiki/N-gram#n-grams_for_approximate_matching" rel="noreferrer">N-grams</a>, which support <strong>multiple-character transpositions across the whole text</strong>.</p> <p>The general idea is that you split the two strings in question into all the possible 2-3 character substrings (n-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be then normalized by dividing the shared number by the total number of n-grams in the longer string. This is trivial to calculate, but fairly powerful.</p> <p>For the example sentences:</p> <pre><code>A. The quick brown fox B. brown quick The fox C. The quiet swine flu </code></pre> <p>A and B share <strong>18</strong> <em>2-grams</em></p> <p>A and C share only <strong>8</strong> <em>2-grams</em> </p> <p>out of <strong>20</strong> total possible.</p> <p>This has been discussed in more detail in the <a href="http://www1.cs.columbia.edu/~pirot/publications/deb-dec2001.pdf" rel="noreferrer">Gravano et al. paper</a>.</p> <h3>tf-idf and cosine similarity</h3> <p>A not so trivial alternative, but grounded in information theory would be to use term <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" rel="noreferrer">term frequency–inverse document frequency (tf-idf)</a> to weigh the tokens, construct sentence vectors and then use <a href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="noreferrer">cosine similarity</a> as the similarity metric.</p> <p>The algorithm is:</p> <ol> <li>Calculate 2-character token frequencies (tf) per sentence. </li> <li>Calculate inverse sentence frequencies (idf), which is a logarithm of a quotient of the number of all sentences in the corpus (in this case 3) divided by the number of times a particular token appears across all sentences. In this case <em>th</em> is in all sentences so it has zero information content (log(3/3)=0). <img src="https://i.stack.imgur.com/Y6qe7.png" alt="idf formula"> </li> <li>Produce the tf-idf matrix by multiplying corresponding cells in the tf and idf tables. <img src="https://i.stack.imgur.com/yfEzMm.png" alt="tfidf"></li> <li>Finally, calculate cosine similarity matrix for all sentence pairs, where A and B are weights from the tf-idf table for the corresponding tokens. The range is from 0 (not similar) to 1 (equal).<br> <img src="https://i.stack.imgur.com/QiVH6m.png" alt="cosine similarity"><br> <img src="https://i.stack.imgur.com/EL8iMm.png" alt="similarity matrix"></li> </ol> <h3>Levenshtein modifications and Metaphone</h3> <p>Regarding other answers. <a href="http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance" rel="noreferrer">Damerau–Levenshtein</a> modificication supports only the transposition of <strong>two adjacent</strong> characters. <a href="http://en.wikipedia.org/wiki/Metaphone" rel="noreferrer">Metaphone</a> was designed to match words that sound the same and not for similarity matching.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload