Note that there are some explanatory texts on larger screens.

plurals
  1. POEfficient strings containing each other
    primarykey
    data
    text
    <p>I have two sets of strings (<code>A</code> and <code>B</code>), and I want to know all pairs of strings <code>a in A</code> and <code>b in B</code> where <code>a</code> is a substring of <code>b</code>.</p> <p>The first step of coding this was the following:</p> <pre><code>for a in A: for b in B: if a in b: print (a,b) </code></pre> <p>However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking <code>if a in b:</code>, check if the regexp <code>'.*' + a + '.*':</code> matches 'b'. I thought that maybe using something like this would let me cache the <a href="http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm" rel="nofollow">Knuth-Morris-Pratt</a> failure function for all <code>a</code>. Also, using a list comprehension for the inner <code>for b in B:</code> loop will likely give a pretty big speedup (and a nested list comprehension may be even better).</p> <p>I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of <code>A</code> and <code>B</code> sets, and I don't want it to run all week).</p> <p><strong>Do you know any tricks or have any generic advice to do this more quickly?</strong> Thanks a lot for any insight you can share!</p> <hr> <p><strong>Edit:</strong></p> <p>Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:</p> <pre><code> import collections prefix_table = collections.defaultdict(set) for k, b in enumerate(B): for i in xrange(len(prot_seq)-10): j = i+10+1 prefix_table[b[i:j]].add(k) for a in A: if len(a) &gt;= 10: for k in prefix_table[a[:10]]: # check if a is in b # (missing_edges is necessary, but not sufficient) if a in B[k]: print (a,b) else: for k in xrange(len(prots_and_seqs)): # a is too small to use the table; check if # a is in any b if a in B[k]: print (a, b) </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload