Note that there are some explanatory texts on larger screens.

plurals
  1. POFast algorithm for searching for substrings in a string
    primarykey
    data
    text
    <p>I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string. </p> <p>What I would like to do is:</p> <p>Given an input string - <strong>INSTR</strong>:</p> <blockquote> <p>"BCDEFGH"</p> </blockquote> <p>And a set of candidate strings - <strong>CAND</strong>:</p> <blockquote> <p>"AB", "CDE", "FG", "H", "IJ"</p> </blockquote> <p><em>Find any <strong>CAND</strong> strings that match as substrings within <strong>INSTR</em></strong></p> <p>In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")</p> <p>There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.</p> <p>I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.</p> <p>Additionally, all the strings in CAND and INSTR will all be relatively small (&lt; 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.</p> <hr> <p><strong>Update</strong> I should have mentioned, the set of CAND strings is invariant across all values of INSTR. </p> <p><strong>Update</strong> I only need to know that there was a match - and i don't need to know what the match was.</p> <p><strong>Final Update</strong> I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used <a href="http://hkn.eecs.berkeley.edu/~dyoo/java/index.html" rel="noreferrer">this</a></p> <p>In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were: </p> <p><strong>AhoCorsick</strong>: 1</p> <p><strong>RabinKarp</strong>: 1.8</p> <p><strong>Naive Search</strong> (check each pattern &amp; use string.contains): 50</p> <hr> <p>*Some resources describing the algos mentioned in the answers below:</p> <p><a href="http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html" rel="noreferrer">http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html</a></p> <p><a href="http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf" rel="noreferrer">http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf</a></p> <p><a href="http://www-igm.univ-mlv.fr/~lecroq/string/index.html" rel="noreferrer">http://www-igm.univ-mlv.fr/~lecroq/string/index.html</a>*</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload