Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.</p> <p><strong>First, a few preliminary statements.</strong></p> <ol> <li><p>What we are building, is <em>basically</em> like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth</p></li> <li><p><strong>But</strong>: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers <code>[from,to]</code>. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).</p></li> </ol> <h2>Basic principle</h2> <p>I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:</p> <pre><code>abc </code></pre> <p>The algorithm <strong>works in steps, from left to right</strong>. There is <strong>one step for every character of the string</strong>. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).</p> <p>So, we start from the <em>left</em>, and first insert only the single character <code>a</code> by creating an edge from the root node (on the left) to a leaf, and labeling it as <code>[0,#]</code>, which means the edge represents the substring starting at position 0 and ending at <em>the current end</em>. I use the symbol <code>#</code> to mean <em>the current end</em>, which is at position 1 (right after <code>a</code>).</p> <p>So we have an initial tree, which looks like this:</p> <p><img src="https://i.stack.imgur.com/aOwIL.png" alt=""></p> <p>And what it means is this:</p> <p><img src="https://i.stack.imgur.com/SZH4k.png" alt=""></p> <p>Now we progress to position 2 (right after <code>b</code>). <strong>Our goal at each step</strong> is to insert <strong>all suffixes up to the current position</strong>. We do this by</p> <ul> <li>expanding the existing <code>a</code>-edge to <code>ab</code></li> <li>inserting one new edge for <code>b</code></li> </ul> <p>In our representation this looks like</p> <p><img src="https://i.stack.imgur.com/onmqt.png" alt="enter image description here"></p> <p>And what it means is:</p> <p><img src="https://i.stack.imgur.com/tchAx.png" alt=""></p> <p><strong>We observe</strong> two things:</p> <ul> <li>The edge representation for <code>ab</code> is <strong>the same</strong> as it used to be in the initial tree: <code>[0,#]</code>. Its meaning has automatically changed because we updated the current position <code>#</code> from 1 to 2.</li> <li>Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.</li> </ul> <p>Next we increment the position again and update the tree by appending a <code>c</code> to every existing edge and inserting one new edge for the new suffix <code>c</code>.</p> <p>In our representation this looks like</p> <p><img src="https://i.stack.imgur.com/wCEdI.png" alt=""></p> <p>And what it means is:</p> <p><img src="https://i.stack.imgur.com/UpUFw.png" alt=""></p> <p><strong>We observe:</strong></p> <ul> <li>The tree is the correct suffix tree <em>up to the current position</em> after each step</li> <li>There are as many steps as there are characters in the text</li> <li>The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing <code>#</code>, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.</li> </ul> <h2>First extension: Simple repetitions</h2> <p>Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:</p> <pre><code>abcabxabcd </code></pre> <p>It starts with <code>abc</code> as in the previous example, then <code>ab</code> is repeated and followed by <code>x</code>, and then <code>abc</code> is repeated followed by <code>d</code>.</p> <p><strong>Steps 1 through 3:</strong> After the first 3 steps we have the tree from the previous example:</p> <p><img src="https://i.stack.imgur.com/AclCh.png" alt=""></p> <p><strong>Step 4:</strong> We move <code>#</code> to position 4. This implicitly updates all existing edges to this:</p> <p><img src="https://i.stack.imgur.com/xhVMY.png" alt=""></p> <p>and we need to insert the final suffix of the current step, <code>a</code>, at the root.</p> <p>Before we do this, we introduce <strong>two more variables</strong> (in addition to <code>#</code>), which of course have been there all the time but we haven't used them so far:</p> <ul> <li>The <strong>active point</strong>, which is a triple <code>(active_node,active_edge,active_length)</code></li> <li>The <code>remainder</code>, which is an integer indicating how many new suffixes we need to insert</li> </ul> <p>The exact meaning of these two will become clear soon, but for now let's just say:</p> <ul> <li>In the simple <code>abc</code> example, the active point was always <code>(root,'\0x',0)</code>, i.e. <code>active_node</code> was the root node, <code>active_edge</code> was specified as the null character <code>'\0x'</code>, and <code>active_length</code> was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.</li> <li>The <code>remainder</code> was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).</li> </ul> <p>Now this is going to change. When we insert the current final character <code>a</code> at the root, we notice that there is already an outgoing edge starting with <code>a</code>, specifically: <code>abca</code>. Here is what we do in such a case:</p> <ul> <li>We <strong>do not</strong> insert a fresh edge <code>[4,#]</code> at the root node. Instead we simply notice that the suffix <code>a</code> is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.</li> <li>We <strong>set the active point</strong> to <code>(root,'a',1)</code>. That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with <code>a</code>, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character <code>a</code>. That suffices because there can be <em>only one</em> edge starting with any particular character (confirm that this is true after reading through the entire description).</li> <li>We also increment <code>remainder</code>, so at the beginning of the next step it will be 2.</li> </ul> <p><strong>Observation:</strong> When the final <strong>suffix we need to insert is found to exist in the tree already</strong>, the tree itself is <strong>not changed</strong> at all (we only update the active point and <code>remainder</code>). The tree is then not an accurate representation of the suffix tree <em>up to the current position</em> any more, but it <strong>contains</strong> all suffixes (because the final suffix <code>a</code> is contained <em>implicitly</em>). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was <strong>no work</strong> done in this step.</p> <p><strong>Step 5:</strong> We update the current position <code>#</code> to 5. This automatically updates the tree to this:</p> <p><img src="https://i.stack.imgur.com/XL6bg.png" alt=""></p> <p>And <strong>because <code>remainder</code> is 2</strong>, we need to insert two final suffixes of the current position: <code>ab</code> and <code>b</code>. This is basically because:</p> <ul> <li>The <code>a</code> suffix from the previous step has never been properly inserted. So it has <em>remained</em>, and since we have progressed one step, it has now grown from <code>a</code> to <code>ab</code>.</li> <li>And we need to insert the new final edge <code>b</code>.</li> </ul> <p>In practice this means that we go to the active point (which points to behind the <code>a</code> on what is now the <code>abcab</code> edge), and insert the current final character <code>b</code>. <strong>But:</strong> Again, it turns out that <code>b</code> is also already present on that same edge.</p> <p>So, again, we do not change the tree. We simply:</p> <ul> <li>Update the active point to <code>(root,'a',2)</code> (same node and edge as before, but now we point to behind the <code>b</code>)</li> <li>Increment the <code>remainder</code> to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.</li> </ul> <p>To be clear: We had to insert <code>ab</code> and <code>b</code> in the current step, but because <code>ab</code> was already found, we updated the active point and did not even attempt to insert <code>b</code>. Why? Because if <code>ab</code> is in the tree, <strong>every suffix</strong> of it (including <code>b</code>) must be in the tree, too. Perhaps only <em>implicitly</em>, but it must be there, because of the way we have built the tree so far.</p> <p>We proceed to <strong>step 6</strong> by incrementing <code>#</code>. The tree is automatically updated to:</p> <p><img src="https://i.stack.imgur.com/bLLT9.png" alt=""></p> <p>Because <strong><code>remainder</code> is 3</strong>, we have to insert <code>abx</code>, <code>bx</code> and <code>x</code>. The active point tells us where <code>ab</code> ends, so we only need to jump there and insert the <code>x</code>. Indeed, <code>x</code> is not there yet, so we split the <code>abcabx</code> edge and insert an internal node:</p> <p><img src="https://i.stack.imgur.com/6HYtR.png" alt=""></p> <p>The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.</p> <p>So we have dealt with <code>abx</code> and decrement <code>remainder</code> to 2. Now we need to insert the next remaining suffix, <code>bx</code>. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called <strong>Rule 1</strong> below, and it applies whenever the <code>active_node</code> is root (we will learn rule 3 for other cases further below). Here is rule 1:</p> <blockquote> <p>After an insertion from root,</p> <ul> <li><code>active_node</code> remains root</li> <li><code>active_edge</code> is set to the first character of the new suffix we need to insert, i.e. <code>b</code></li> <li><code>active_length</code> is reduced by 1</li> </ul> </blockquote> <p>Hence, the new active-point triple <code>(root,'b',1)</code> indicates that the next insert has to be made at the <code>bcabx</code> edge, behind 1 character, i.e. behind <code>b</code>. We can identify the insertion point in O(1) time and check whether <code>x</code> is already present or not. If it was present, we would end the current step and leave everything the way it is. But <code>x</code> is not present, so we insert it by splitting the edge:</p> <p><img src="https://i.stack.imgur.com/YVvbJ.png" alt=""></p> <p>Again, this took O(1) time and we update <code>remainder</code> to 1 and the active point to <code>(root,'x',0)</code> as rule 1 states.</p> <p>But there is one more thing we need to do. We'll call this <strong>Rule 2:</strong></p> <blockquote> <p>If we split an edge and insert a new node, and if that is <em>not the first node</em> created during the current step, we connect the previously inserted node and the new node through a special pointer, a <strong>suffix link</strong>. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:</p> </blockquote> <p><img src="https://i.stack.imgur.com/zL9yl.png" alt=""></p> <p>We still need to insert the final suffix of the current step, <code>x</code>. Since the <code>active_length</code> component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with <code>x</code>, we insert a new edge:</p> <p><img src="https://i.stack.imgur.com/992gV.png" alt=""></p> <p>As we can see, in the current step all remaining inserts were made.</p> <p>We proceed to <strong>step 7</strong> by setting <code>#</code>=7, which automatically appends the next character, <code>a</code>, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to <code>(root,'a',1)</code>.</p> <p>In <strong>step 8</strong>, <code>#</code>=8, we append <code>b</code>, and as seen before, this only means we update the active point to <code>(root,'a',2)</code> and increment <code>remainder</code> without doing anything else, because <code>b</code> is already present. <strong>However,</strong> we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to <code>(node1,'\0x',0)</code>. Here, I use <code>node1</code> to refer to the internal node the <code>ab</code> edge ends at.</p> <p>Then, in <strong>step <code>#</code>=9</strong>, we need to insert 'c' and this will help us to understand the final trick:</p> <h2>Second extension: Using suffix links</h2> <p>As always, the <code>#</code> update appends <code>c</code> automatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to <code>(node1,'c',1)</code>, increment <code>remainder</code> and do nothing else.</p> <p>Now in <strong>step <code>#</code>=10</strong>, <code>remainder is 4</code>, and so we first need to insert <code>abcd</code> (which remains from 3 steps ago) by inserting <code>d</code> at the active point.</p> <p>Attempting to insert <code>d</code> at the active point causes an edge split in O(1) time:</p> <p><img src="https://i.stack.imgur.com/Rkdzd.png" alt=""></p> <p>The <code>active_node</code>, from which the split was initiated, is marked in red above. Here is the final rule, <strong>Rule 3:</strong></p> <blockquote> <p>After splitting an edge from an <code>active_node</code> that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the <code>active_node</code> to the node it points to. If there is no suffix link, we set the <code>active_node</code> to the root. <code>active_edge</code> and <code>active_length</code> remain unchanged.</p> </blockquote> <p>So the active point is now <code>(node2,'c',1)</code>, and <code>node2</code> is marked in red below:</p> <p><img src="https://i.stack.imgur.com/0IS5C.png" alt=""></p> <p>Since the insertion of <code>abcd</code> is complete, we decrement <code>remainder</code> to 3 and consider the next remaining suffix of the current step, <code>bcd</code>. Rule 3 has set the active point to just the right node and edge so inserting <code>bcd</code> can be done by simply inserting its final character <code>d</code> at the active point.</p> <p>Doing this causes another edge split, and <strong>because of rule 2</strong>, we must create a suffix link from the previously inserted node to the new one:</p> <p><img src="https://i.stack.imgur.com/DNVQO.png" alt=""></p> <p><strong>We observe:</strong> Suffix links enable us to reset the active point so we can make the next <em>remaining insert</em> at O(1) effort. Look at the graph above to confirm that indeed node at label <code>ab</code> is linked to the node at <code>b</code> (its suffix), and the node at <code>abc</code> is linked to <code>bc</code>.</p> <p>The current step is not finished yet. <code>remainder</code> is now 2, and we need to follow rule 3 to reset the active point again. Since the current <code>active_node</code> (red above) has no suffix link, we reset to root. The active point is now <code>(root,'c',1)</code>.</p> <p>Hence the next insert occurs at the one outgoing edge of the root node whose label starts with <code>c</code>: <code>cabxabcd</code>, behind the first character, i.e. behind <code>c</code>. This causes another split:</p> <p><img src="https://i.stack.imgur.com/wZ7Bj.png" alt=""></p> <p>And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:</p> <p><img src="https://i.stack.imgur.com/urgol.png" alt=""></p> <p>(I am using <a href="http://www.graphviz.org/" rel="noreferrer">Graphviz Dot</a> for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)</p> <p>With this, <code>remainder</code> can be set to 1 and since the <code>active_node</code> is root, we use rule 1 to update the active point to <code>(root,'d',0)</code>. This means the final insert of the current step is to insert a single <code>d</code> at root:</p> <p><img src="https://i.stack.imgur.com/TPxLe.png" alt=""></p> <p>That was the final step and we are done. There are number of <strong>final observations</strong>, though:</p> <ul> <li><p>In each step we move <code>#</code> forward by 1 position. This automatically updates all leaf nodes in O(1) time.</p></li> <li><p>But it does not deal with a) any suffixes <em>remaining</em> from previous steps, and b) with the one final character of the current step.</p></li> <li><p><code>remainder</code> tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position <code>#</code>. We consider one after the other and make the insert. <strong>Important:</strong> Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are <em>contained implicitly</em> (otherwise the active point would not be where it is).</p></li> <li><p>After each such insert, we decrement <code>remainder</code> and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.</p></li> <li><p>If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if <code>remainder</code>>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all <em>implicit</em> in the current tree. The fact that <code>remainder</code>>0 makes sure we deal with the remaining suffixes later.</p></li> <li><p>What if at the end of the algorithm <code>remainder</code>>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign <code>$</code> is used as a symbol for that. <strong>Why does that matter?</strong> --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they <em>end at a leaf</em>. Otherwise we would get a lot of spurious matches, because there are <em>many</em> strings <em>implicitly</em> contained in the tree that are not actual suffixes of the main string. Forcing <code>remainder</code> to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. <strong>However,</strong> if we want to use the tree to search for <em>general substrings</em>, not only <em>suffixes</em> of the main string, this final step is indeed not required, as suggested by the OP's comment below.</p></li> <li><p>So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make <code>remainder</code> inserts, each taking O(1) time. Since <code>remainder</code> indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).</p></li> <li><p>However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its <code>active_length</code> component does not work well with the new <code>active_node</code>. For example, consider a situation like this:</p></li> </ul> <p><img src="https://i.stack.imgur.com/7t0dg.png" alt=""></p> <p>(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)</p> <p>Now let the active point be <code>(red,'d',3)</code>, so it points to the place behind the <code>f</code> on the <code>defg</code> edge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is <code>(green,'d',3)</code>. However, the <code>d</code>-edge going out of the green node is <code>de</code>, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to <code>(blue,'f',1)</code>.</p> <p>In a particularly bad case, the <code>active_length</code> could be as large as <code>remainder</code>, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n<sup>2</sup>) complexity, because in each step <code>remainder</code> is generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?</p> <p>No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and <code>active_length</code> will be reduced. As we follow down the chain of suffix links we make the remaining inserts, <code>active_length</code> can only decrease, and the number of active-point adjustments we can make on the way can't be larger than <code>active_length</code> at any given time. Since <code>active_length</code> can never be larger than <code>remainder</code>, and <code>remainder</code> is O(n) not only in every single step, but the total sum of increments ever made to <code>remainder</code> over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload