Note that there are some explanatory texts on larger screens.

plurals
  1. POFast fuzzy/approximate search in dictionary of strings in Ruby
    primarykey
    data
    text
    <p>I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search.</p> <p><strong>My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there was a significantly faster algorithm to do so</strong></p> <p>For this, I first tried to compute all the Levenshtein distances and take the minimum and it was obviously horribly slow. So I tried to implement a Levenshtein Trie based on this article <a href="http://stevehanov.ca/blog/index.php?id=114">http://stevehanov.ca/blog/index.php?id=114</a></p> <p>See my gist here for reproducing the benchmark: <a href="https://gist.github.com/nicolasmeunier/7493947">https://gist.github.com/nicolasmeunier/7493947</a></p> <p>Here are a few benchmarks I got on my machine:</p> <p><strong>Edit distance of 0 (perfect match)</strong></p> <pre><code>Benchmark.measure { 10.times { dictionary.search(random_word, 0) } } &lt;Benchmark::Tms:0x007fa59bad8908 @label="", @real=0.010889, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801&gt; </code></pre> <p>*<em>Edit distance of 2, it becomes a LOT slower *</em></p> <pre><code>Benchmark.measure { 10.times { dictionary.search(random_word, 2) } } &lt;Benchmark::Tms:0x007fa58c9ca778 @label="", @real=3.404604, @cstime=0.0, @cutime=0.0, @stime=0.020000000000000018, @utime=3.3900000000000006, @total=3.4100000000000006&gt; </code></pre> <p>And it goes downhill from there and become extremely slow for edit distance larger than 2. (1+ second on average per tested string).</p> <p>I would like to know how/if I could speed this up significantly. If there are existing solutions already implemented in ruby/gems, I also don't want to reinvent the wheel...</p> <p>EDIT 1: In my case, I expect most of the strings I am matching against the dictionary NOT to be in there. So if there are any algorithm to quickly discard a string, that could really help.</p> <p>Thanks, Nicolas</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload