Note that there are some explanatory texts on larger screens.

plurals
  1. POBetter regex syntax ideas
    primarykey
    data
    text
    <p>I need some help to complete my idea about regexes.</p> <h2>Introduction</h2> <p>There was a <a href="https://stackoverflow.com/questions/1579023/design-of-an-alternative-fluent-interface-for-regular-expressions">question about better syntax</a> for regexes on SE, but I don't think I'd use the fluent syntax. It's surely nice for newbies, but in case of a complicated regex, you replace a line of gibberish by a whole page of slightly better gibberish. I like the <a href="http://martinfowler.com/bliki/ComposedRegex.html" rel="nofollow noreferrer">approach by Martin Fowler</a>, where a regex gets composed of smaller pieces. His solution is readable, but hand-made; he proposes a smart way to build a complicated regex instead of a class supporting it.</p> <p>I'm trying to make it to a class using something like (see his example first)</p> <pre><code>final MyPattern pattern = MyPattern.builder() .caseInsensitive() .define("numberOfPoints", "\\d+") .define("numberOfNights", "\\d+") .define("hotelName", ".*") .define(' ', "\\s+") .build("score `numberOfPoints` for `numberOfNights` nights? at `hotelName`"); MyMatcher m = pattern.matcher("Score 400 FOR 2 nights at Minas Tirith Airport"); System.out.println(m.group("numberOfPoints")); // prints 400 </code></pre> <p>where fluent syntax is used for combining regexes extended as follows:</p> <ul> <li>define named patterns and use them by enclosing in backticks <ul> <li><code>`name`</code> creates a named group <ul> <li>mnemonics: shell captures the result of the command enclosed in backticks</li> </ul></li> <li><code>`:name`</code> creates a non-capturing group <ul> <li>mnemonics: similar to <code>(?:</code>...<code>)</code></li> </ul></li> <li><code>`-name`</code> creates a backreference <ul> <li>mnemonics: the dash connects it to the previous occurrence</li> </ul></li> </ul></li> <li>redefine individual characters and use it everywhere unless quoted <ul> <li>here only some characters (e.g., <code>~ @#%</code>") are allowed <ul> <li>redefining <code>+</code> or <code>(</code> would be extremely confusing, so it's not allowed</li> <li>redefining space to mean any spacing is very natural in the example above</li> <li>redefining a character could make the pattern more compact, which is good unless overused</li> <li>e.g., using something like <code>define('#', "\\\\")</code> for matching backslashes could make the pattern much readable</li> </ul></li> </ul></li> <li>redefine some quoted sequences like <code>\s</code> or <code>\w</code> <ul> <li>the standard definitions are <a href="https://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261">not Unicode conform</a></li> <li>sometimes you might have you own idea what a word or space is</li> </ul></li> </ul> <p>The named patterns serves as a sort of local variables helping to decompose a complicated expression into small and easy to understand pieces. A proper naming pattern makes often a comment unnecessary.</p> <h2>Questions</h2> <p>The above shouldn't be hard to implement (I did already most of it) and could be really useful, I hope. <em>Do you think so?</em></p> <p>However, I'm not sure how it should behave inside of brackets, sometimes it's meaningful to use the definitions and sometimes not, e.g. in</p> <pre><code>.define(' ', "\\s") // a blank character .define('~', "/\**[^*]+\*/") // an inline comment (simplified) .define("something", "[ ~\\d]") </code></pre> <p>expanding the space to <code>\s</code> makes sense, but expanding the tilde doesn't. <em>Maybe there should be a separate syntax to define own character classes somehow?</em></p> <p><em>Can you think of some examples where the named pattern are very useful or not useful at all?</em> I'd need some border cases and some ideas for improvement.</p> <h1>Reaction to tchrist's answer</h1> <h2>Comments to his objections</h2> <ol> <li>Lack of multiline pattern strings. <ul> <li>There are no multiline strings in Java, which I'd like to change, but can not.</li> </ul></li> <li>Freedom from insanely onerous and error-prone double-backslashing... <ul> <li>This is again something I can't do, I can only offer a workaround, s. below.</li> </ul></li> <li>Lack of compile-time exceptions on invalid regex literals, and lack of compile-time caching of correctly compiled regex literals. <ul> <li>As regexes are just a part of the standard library and not of the language itself, there's nothing what can done here.</li> </ul></li> <li>No debugging or profiling facilities. <ul> <li>I can do nothing here.</li> </ul></li> <li>Lack of compliance with UTS#18. <ul> <li>This can be easily solved by redefining the corresponding patterns as I proposed. It's not perfect, since in debugger you'll see the blowed up replacements.</li> </ul></li> </ol> <p>I looks like you don't like Java. I'd be happy to see some syntax improvements there, but there's nothing I can do about it. I'm looking for something working with current Java.</p> <h2>RFC 5322</h2> <p>Your example can be easily written using my syntax:</p> <pre><code>final MyPattern pattern = MyPattern.builder() .define(" ", "") // ignore spaces .useForBackslash('#') // (1): see (2) .define("address", "`mailbox` | `group`") .define("WSP", "[\u0020\u0009]") .define("DQUOTE", "\"") .define("CRLF", "\r\n") .define("DIGIT", "[0-9]") .define("ALPHA", "[A-Za-z]") .define("NO_WS_CTL", "[\u0001-\u0008\u000b\u000c\u000e-\u001f\u007f]") // No whitespace control ... .define("domain_literal", "`CFWS`? #[ (?: `FWS`? `dcontent`)* `FWS`? #] `CFWS1?") // (2): see (1) ... .define("group", "`display_name` : (?:`mailbox_list` | `CFWS`)? ; `CFWS`?") .define("angle_addr", "`CFWS`? &lt; `addr_spec` `CFWS`?") .define("name_addr", "`display_name`? `angle_addr`") .define("mailbox", "`name_addr` | `addr_spec`") .define("address", "`mailbox` | `group`") .build("`address`"); </code></pre> <h2>Disadvantages</h2> <p>While rewriting your example I encountered the following issues:</p> <ul> <li>As there are no <code>\xdd</code> escape sequences <code>\udddd</code> must be used</li> <li>Using another character instead of backslash is a bit strange</li> <li>As I prefer to write it bottom-up, I had to take your lines reverted</li> <li>Without much idea what it does, I except myself having done some errors</li> </ul> <p>On the bright side: - Ignoring spaces is no problem - Comments are no problem - The readability is good</p> <p>And most important: <strong><em>It's plain Java and uses the existing regex-engine as is.</em></strong></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload