StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POUkkonen's suffix tree algorithm in plain English
text
Body
copied!<p>I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is <em><a href="http://marknelson.us/1996/08/01/suffix-trees/" rel="noreferrer">Fast String Searching With Suffix Trees</a></em>, but he glosses over various points and some aspects of the algorithm remain unclear.</p> <p>A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.</p> <p>For reference, here's Ukkonen's paper on the algorithm: <a href="http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf" rel="noreferrer">http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf</a></p> <p>My basic understanding, so far:</p> <ul> <li>I need to iterate through each prefix P of a given string T</li> <li>I need to iterate through each suffix S in prefix P and add that to tree</li> <li>To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.</li> </ul> <p>The basic algorithm appears to be O(n<sup>2</sup>), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think <em>that</em> is what I'm having trouble understanding.</p> <p>I'm also having trouble understanding:</p> <ul> <li>exactly when and how the "active point" is assigned, used and changed</li> <li>what is going on with the canonization aspect of the algorithm</li> <li>Why the implementations I've seen need to "fix" bounding variables that they are using</li> </ul> <hr> <p>Here is the completed <strong>C#</strong> source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:</p> <blockquote> <p><strong><a href="https://gist.github.com/2373868" rel="noreferrer">https://gist.github.com/2373868</a></strong></p> </blockquote> <hr> <p><strong>Update 2017-11-04</strong></p> <p>After many years I've found a new use for suffix trees, and have implemented the algorithm in <strong>JavaScript</strong>. Gist is below. It should be bug-free. Dump it into a js file, <code>npm install chalk</code> from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.</p> <blockquote> <p><strong><a href="https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6" rel="noreferrer">https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6</a></strong></p> </blockquote>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload