Note that there are some explanatory texts on larger screens.

plurals
  1. POFinding groups of similar strings in a large set of strings
    text
    copied!<p>I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.</p> <p>As an example let's say the input list is on the left below, and the output groups are on the right.</p> <pre><code>Input Output ----------------- ----------------- Jane Doe Mr Philip Roberts Mr Philip Roberts Phil Roberts Foo McBar Philip Roberts David Jones Phil Roberts Foo McBar Davey Jones =&gt; John Smith David Jones Philip Roberts Dave Jones Dave Jones Davey Jones Jonny Smith Jane Doe John Smith Jonny Smith </code></pre> <p>Does anybody know of any ways to solve this reasonably efficiently? </p> <p>The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.</p> <p>An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists</p> <p>Does anybody have any thoughts/pointers?</p> <hr> <p>UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload