Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see <a href="http://en.wikipedia.org/wiki/Suffix_tree" rel="nofollow noreferrer">http://en.wikipedia.org/wiki/Suffix_tree</a>) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.</p> <p>So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).</p> <p>The code below uses this suffix tree implementation: <a href="http://www.daimi.au.dk/~mailund/suffix_tree.html" rel="nofollow noreferrer">http://www.daimi.au.dk/~mailund/suffix_tree.html</a>. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.</p> <pre><code>from suffix_tree import GeneralisedSuffixTree baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' } def revc(seq): return "".join([baseComplement[base] for base in seq[::-1]]) data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA" # revc TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT # 012345678901234567890123456789012 # 1 2 3 minlength = 6 n = len(data) tree = GeneralisedSuffixTree([data, revc(data)]) for shared in tree.sharedSubstrings(minlength): #print shared _, start, stop = shared[0] seq = data[start:stop] _, rstart, rstop = shared[1] rseq = data[n-rstop:n-rstart] print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart) </code></pre> <p>This produces output</p> <pre class="lang-none prettyprint-override"><code>Match: AGGTCA at [23:29] and TGACCT at [10:16] Match: TGACCT at [10:16] and AGGTCA at [23:29] Match: CTGCAG at [19:25] and CTGCAG at [19:25] </code></pre> <p>It finds each match twice, once from each end. And there's a reverse palindrome in there, too!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload