Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <h3>The Big Picture</h3> <p>We will first look at this regex from the overall big picture algorithm, and then take a closer look at the specific implementation details later. The regex is an almost direct translation of the following Java code:</p> <pre><code>static boolean isPalindrome(String s) { if (s.isEmpty()) { return true; } String g2 = null; for (char ch : s.toCharArray()) { String g1 = String.valueOf(ch); // "add" if (g2 != null &amp;&amp; s.endsWith(g1 + g2)) { g2 = g1 + g2; } else if (s.endsWith(g1)) { g2 = g1; } else { break; } } return s.equals(g2); // "chk" } </code></pre> <p>This is obviously not the most straightforward/efficient Java code to check for palindromes, but it works, and most fascinatingly, it's almost directly translatable to regex with a near one-to-one mapping. Here's the regex again, reproduced here for convenience, annotated to highlight the striking resemblance:</p> <pre><code>// isEmpty _for-loop_ // ↓ / \ "(?x) | (?:(.) add)+ chk" // \_/ ↑ // g1 loop body ___g2___ // / \ .replace("add", assertEntirety(".*? (\\1 \\2?)")) .replace("chk", assertEntirety("\\2")); // s.equals(g2) </code></pre> <p><strong><em>Attachment</em></strong>: <a href="http://ideone.com/53JYq" rel="nofollow noreferrer">annotated and expanded version of the source code on ideone.com</a></p> <p>(Feel free to ignore the details of <code>assertEntirety</code> for now: just think of it as a black box regex mechanism that can make an assertion on <em>the entire string</em> regardless of where we currently are.)</p> <p>So the basic algorithm is that we try to build a suffix, subject to a palindromic constraint, as we scan the string left to right. We then check if we're able to build the complete string in this manner. If we can, then the string is a palindrome. Also, as a special case, the empty string is trivially a palindrome.</p> <p>Once the big picture algorithm is understood, we can examine how the regex pattern implements it.</p> <hr> <h3>What's with all the <code>String.replace</code>?</h3> <p>Regex patterns in Java are ultimately nothing but strings, meaning they can be derived through string manipulations the way any string can. Yes, we can even use regex to generate a regex pattern -- a sort of meta-regexing approach, if you will.</p> <p>Consider this example of initializing an <code>int</code> constant (which ultimately contains nothing but a number):</p> <pre><code>final int X = 604800; final int Y = 60 * 60 * 24 * 7; // now X == Y </code></pre> <p>The number assigned to <code>X</code> is a literal integer: we can <em>clearly see</em> what the number is. This is not the case with <code>Y</code> which uses an expression instead, and yet this formula seems to convey an <em>idea</em> of what this number represents. Even without proper naming of these constants, we nonetheless get the idea that <code>Y</code> probably represents the number of seconds in a week, even if we may not immediately know what the numeric value is. On the other hand, with <code>X</code> we know precisely that number, but we get less of an idea of what it represents.</p> <p>The use of string replacements in the snippet is an analogous situation but for strings regex patterns. Instead of explicitly writing the pattern as one literal string, sometimes <em>systematic and logical derivation</em> (the "formula") of that value from simpler parts can be much more meaningful. This is especially true with regex, where it often matters more that we understand what the pattern does than being able to see what it looks like as a string literal (which isn't much of a looker anyway, what with all the extra backslashes).</p> <p>A portion of the snippet is reproduced here again for convenience:</p> <pre><code>// the "formula" final String PALINDROME = "(?x) | (?:(.) add)+ chk" .replace("add", assertEntirety(".*? (\\1 \\2?)")) .replace("chk", assertEntirety("\\2")); // the "value" System.out.println(PALINDROME); // ____add_____ chk_ // _______/ \____ _______/ \_____ // (?x) | (?:(.) (?&lt;=(?=^.*? (\1 \2?)$).*))+ (?&lt;=(?=^\2$).*) // | \_/ \______/ | // | 1 2 | // |_______________________________| </code></pre> <p>Without a doubt the "formula" is a lot more readable than the eventual string "value" in this case.</p> <p>There are certainly much more sophisticated ways to programmatically generate a regex pattern, and it's certainly possible to write in such a way that obfuscates instead of accentuates its meaning, but mindful usage of even simple string replacements can still do wonder (as hopefully shown in this example).</p> <p><strong><em>Lesson</em></strong>: Consider programmatic generation of regex patterns.</p> <hr> <h3>How does <code>add</code> work?</h3> <p>The <code>(?:(.) add)+</code> construct, where <code>add</code> is an assertion that does some sort of "counting", has already been thoroughly discussed in the previous two parts. Two features are worth noting, though:</p> <ul> <li>The <code>(.)</code> captures into group 1, allowing backreference later on</li> <li>The assertion is <code>assertEntirety</code> instead of just looking ahead from our current position <ul> <li>We'll discuss this in more detail later; just think of it as a way to assert on the entire string</li> </ul></li> </ul> <p>The pattern applied to <code>assertEntirety</code> in <code>add</code> is the following:</p> <pre><code># prefix _suffix_ # ↓ / \ .*? ( \1 \2? ) # \________/ i.e. a reluctant "whatever" prefix (as short as possible) # group 2 followed by a suffix captured into group 2 </code></pre> <p>Note that group 2 is self-referencing with an optional specifier, a technique already discussed in part 2 of the series. Needless to say group 2 is our "counter" in this pattern: it's a suffix that we will try to grow leftward on every iteration of the "loop". As we iterate on each <code>(.)</code> left to right, we try to prepend that same character (using backreference to <code>\1</code>) to our suffix.</p> <p>Recall again the Java code translation of the above pattern, reproduced here for convenience:</p> <pre><code>if (g2 != null &amp;&amp; s.endsWith(g1 + g2)) { // \2? is greedy, we try this first g2 = g1 + g2; } else if (s.endsWith(g1)) { // since \2? is optional, we may also try this g2 = g1; } else { // if there's no matching suffix, we "break" out of the "loop" break; } </code></pre> <p>The fact that <code>\2?</code> is optional means a few things:</p> <ul> <li>It provides a "base case" for the self-reference (the main reason we do this!)</li> <li>Since <code>\2?</code> is part of the suffix pattern (and thus appears later in the overall pattern), the prefix part must be reluctant, hence <code>.*?</code> instead of <code>.*</code>. This allows <code>\2?</code> to exercise its greediness.</li> <li>The "counter" may also "reset" and give the "wrong" result <ul> <li>In part 2 we showed how backtracking <code>?</code> may result in the same kind of problematic resetting <ul> <li>We solved the problem by using possessive quantifier <code>?+</code>, but this is not applicable here</li> </ul></li> </ul></li> </ul> <p>The third point is elaborated further in the next section.</p> <p><strong><em>Lesson</em></strong>: Carefully analyze the interactions between greedy/reluctant repetitions in parts of a pattern.</p> <h3>Related questions</h3> <ul> <li><a href="https://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532">Difference between <code>.*?</code> and <code>.*</code> for regex</a></li> <li><a href="https://stackoverflow.com/questions/2565924/regular-expression-whos-greedier">Regular expression: who's greedier?</a></li> </ul> <hr> <h3>Why do we need a <code>chk</code> phase?</h3> <p>As alluded in the previous section, the optional and backtrackable <code>\2?</code> means that our suffix can shrink under some circumstances. We will examine such a scenario step by step for this input:</p> <pre><code> x y x y z y x ↑ # Initial state, \2 is "uninitialized" _ (x)y x y z y x ↑ # \1 captured x, \2 couldn't match \1\2 (since \2 is "uninitialized") # but it could match \1 so it captured x ___ x(y)x y z y x ↑ # \1 captured y, \2 matched \1\2 and grew to capture yx _ x y(x)y z y x ↑ # \1 captured x, \2 couldn't match \1\2, # but it could match \1 so it shrunk to capture x (!!!) ___ x y x(y)z y x ↑ # \1 captured y, \2 matched \1\2 and grew to capture yx _____ x y x y(z)y x ↑ # \1 captured z, \2 matched \1\2 and grew to capture zyx _______ x y x y z(y)x ↑ # \1 captured y, \2 matched \1\2 and grew to capture yzyx _________ x y x y z y(x) ↑ # \1 captured x, \2 matched \1\2 and grew to capture xyzyx </code></pre> <p>We can modify our pattern (and the corresponding Java code) to omit the <code>chk</code> phase, and see that indeed this is what happens:</p> <pre><code> // modified pattern without a chk phase; yields false positives! final String PALINDROME_BROKEN = "(?x) | (?:(.) add)+" .replace("add", assertEntirety(".*? (\\1 \\2?)")); String s = "xyxyzyx"; // NOT a palindrome!!! Matcher m = Pattern.compile(PALINDROME_BROKEN).matcher(s); if (m.matches()) { System.out.println(m.group(2)); // prints "xyzyx" } </code></pre> <p>As explained, <code>"xyxyzyx"</code>, which is <em>NOT</em> a palindrome, is falsely reported as one, because we didn't check if the growing suffix eventually became the complete string (which it clearly did not in this case). The <code>chk</code> phase (which is an <code>assertEntirety</code> of the pattern <code>\2</code>) is therefore an absolute necessity in our setup. We need to confirm that indeed we managed to grow our suffix all the way. If this is the case, then we have ourselves a palindrome.</p> <p><strong><em>Lesson</em></strong>: Carefully analyze any possibly unintended side-effects of optional self-reference matching.</p> <hr> <h3>The Main Course: <code>assertEntirety</code></h3> <p>While it's neat that we can write a Java regex pattern to detect palindromes, everything here except <code>assertEntirety</code> has already been covered in the previous parts of the series. The only new thing here is this mysterious black box, this powerful mechanism that magically allowed us to do what is otherwise "impossible".</p> <p>The <code>assertEntirety</code> construct is based on the following meta-pattern of nested lookarounds:</p> <blockquote> <p><code>(?&lt;=(?=^pattern$).*)</code></p> <blockquote> <p>" <em>I can see a place somewhere behind me where I can look ahead and see</em> <code>^pattern$</code> "</p> </blockquote> </blockquote> <p>The name "lookaround" implies the relativity to our current position: we're looking <em>around</em> us, perhaps in front of or behind, from where we're standing. By nesting a lookahead in a lookbehind in this manner, we're able to metaphorically "fly into the skies" and look at the entire picture.</p> <p>Abstracting this meta-pattern into <code>assertEntirety</code> is a bit like writing preprocessing substitution macros. Having nested lookarounds everywhere probably hurts readability and maintainability, so we encapsulate it into <code>assertEntirety</code>, which not only hides the complexity of its inner workings, but also further emphasizes its semantics by giving it an appropriate name.</p> <p><strong><em>Lesson</em></strong>: Consider abstracting meta-patterns to hide complexity and convey semantics.</p> <hr> <h3>Appendix: On infinite-length lookbehind in Java</h3> <p>Observant readers would notice that <code>assertEntirety</code> contains a <code>.*</code> in a lookbehind, which makes its theoretical maximum length infinite. No, Java does not officially support infinite-length lookbehind. Yes, as it has been adequatedly demonstrated here, it works anyway. Officially it's categorized as a "bug"; but "someone"<sup>(*wink*)</sup> could also consider it a "hidden feature".</p> <p>It's certainly possible that this "bug" will be "fixed" in the future. Removal of this hidden feature will break this particular solution to the Java regex palindrome problem.</p> <h3>Related questions</h3> <ul> <li><a href="https://stackoverflow.com/questions/1536915/regex-look-behind-without-obvious-maximum-length-in-java">Regex look-behind without obvious maximum length in Java</a></li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload