StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>What we're calling mismatches is otherwise known as <a href="http://en.wikipedia.org/wiki/Hamming_distance" rel="nofollow" title="Hamming Distance">Hamming Distance</a> which is just a count of how many characters do not match between the strings (allowing only for substitution - not insertion or deletion). </p> <p>Therefore, Mladen's code to count that can be used in the find_substring function to determine if the string is within the number of allowed mismatches. </p> <p>Then, if it is, you can return it (or add it to a list of matches if you want to track them all). After that check, you do the test to set high or low depending on if it is greater than or less than the comparison. </p> <p>Here is how I changed the code:</p> <pre><code>def find_substring(the_substring, n_mismatches) #uses typical binary search high = @suffix_array.length - 1 low = 0 while(low <= high) mid = (high + low) / 2 this_suffix = @suffix_array[mid][:suffix] compare_len = the_substring.length-1 comparison = this_suffix[0..compare_len] if n_mismatches == 0 within_n_mismatches = comparison == the_substring else within_n_mismatches = hamming_distance(the_substring, comparison) <= n_mismatches end return @suffix_array[mid][:position] if within_n_mismatches if comparison > the_substring high = mid - 1 else low = mid + 1 end end return nil end def hamming_distance(a, b) # from Mladen Jablanović's answer at http://stackoverflow.com/questions/5322428/finding-a-substring-while-allowing-for-mismatches-with-ruby a.chars.zip(b.chars).count{|ca, cb| ca != cb} end </code></pre> <p>It will add some processing time - linear with respect to the size of the substring, but I think that may come out as not that much given the size of the rest of the data. I haven't really thought through that part as much as I might, but maybe you want to test it versus another approach: making the changes to your input string and searching it multiple times.</p> <p>For instance, if you are working with DNA if your substring was "GAC" you would search for that, plus "AAC" and "CAC" and "TAC" (and then the combinations for the nucleotides in the 2nd and 3rd positions. The number of possibilities should keep it small enough to fit in memory. </p> <p>The opposite - storing all the mismatch possibilities in the suffix array - would not be true. Since it's probably already large, multiplying it by itself a few times is going to make it too large to fit in memory pretty quickly.</p> <p>I've used that approach before - not quite with a suffix array, but just storing the mismatches in general. </p> <p>In addition to the code above, I also modified it a bit to add a function for getting all matches. I posted that to <a href="https://github.com/codeodor/miscellany/blob/master/ruby_suffix_array/suffix_array.rb" rel="nofollow">one of my repositories at github</a>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload