Note that there are some explanatory texts on larger screens.

plurals
  1. POLongest repeated (k times) substring
    text
    copied!<p>I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered.</p> <p>This is for the <a href="http://rosalind.info/problems/lrep/" rel="nofollow noreferrer">Rosalind project problem LREP</a>. I'm trying to find the longest k-peated substring in a string and I've been <strong>provided</strong> the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with <code>&gt;=k</code> descendants, and finally find the deepest of those nodes. Theory-wise I'm set.</p> <p>I've gotten a lot of help from the following resources (oops, I can only post 2):</p> <ul> <li><a href="https://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string">Find longest repetitive sequence in a string</a></li> <li><a href="http://en.literateprograms.org/Depth-first_search_%28Python%29" rel="nofollow noreferrer">Depth-first search (Python)</a></li> </ul> <p>I can get the paths from the root to each leaf, but I can't figure out how to pre-process the tree in such a way that I can get the number of descendants from each node. I have a separate algorithm that works on small sequences but it's in exponential complexity, so for larger stuff it takes way too long. I know with a DFS I should be able to perform the whole task in linear complexity. For this algorithm to work I need to be able to get the longest k-peat of an ~40,000 length string in less than 5 minutes.</p> <p>Here's some sample data (first line: <code>sequence</code>, second line: <code>k</code>, suffix table format: <code>parent child location length</code>):</p> <pre><code>CATACATAC$ 2 1 2 1 1 1 7 2 1 1 14 3 3 1 17 10 1 2 3 2 4 2 6 10 1 3 4 6 5 3 5 10 1 7 8 3 3 7 11 5 1 8 9 6 5 8 10 10 1 11 12 6 5 11 13 10 1 14 15 6 5 14 16 10 1 </code></pre> <p>The output from this should be <code>CATAC</code>.</p> <p>With the following code (modified from <a href="http://en.literateprograms.org/Depth-first_search_%28Python%29" rel="nofollow noreferrer">LiteratePrograms</a>) I've been able to get the paths, but it still takes a long time on longer sequences to parse out a path for each node.</p> <pre><code>#authors listed at #http://en.literateprograms.org/Depth-first_search_(Python)?action=history&amp;offset=20081013235803 class Vertex: def __init__(self, data): self.data = data self.successors = [] def depthFirstSearch(start, isGoal, result): if start in result: return False result.append(start) if isGoal(start): return True for v in start.successors: if depthFirstSearch(v, isGoal, result): return True # No path was found result.pop() return False def lrep(seq,reps,tree): n = 2 * len(seq) - 1 v = [Vertex(i) for i in xrange(n)] edges = [(int(x[0]),int(x[1])) for x in tree] for a, b in edges: v[a].successors.append(v[b]) paths = {} for x in v: result = [] paths[x.data] = [] if depthFirstSearch(v[1], (lambda v: v.data == x.data), result): path = [u.data for u in result] paths[x.data] = path </code></pre> <p>What I'd like to do is pre-process the tree to find nodes which satisfy the <code>descendants &gt;= k</code> requirement prior to finding the depth. I haven't even gotten to how I'm going to calculate depth yet. Though I imagine I'll have some dictionary to keeps track of the depths of each node in the path then sums.</p> <p>So, my first-most-important question is: <em>"How do I preprocess the tree with descendant leaves?"</em></p> <p>My second-less-important question is: <em>"After that, how can I quickly compute depth?"</em></p> <p>P.S. I should state that this <strong>isn't</strong> homework or anything of that sort. I'm just a biochemist trying to expand my horizons with some computational challenges.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload