StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POUsing C++11 regex to capture the contents of a context-free-grammar file
text
Body
copied!<h1>Preface</h1> <p>I'm trying to write my own context-free-grammar specification, to associate with the rules of my lexer/parser. It is meant to be similar to that of <a href="http://www.antlr.org/" rel="nofollow">ANTLR</a>'s, where upper-case identifiers classify as a Lexer rule and lower-case identifiers classify as a Parser rule. It is meant to accept any combination of string literals and/or regular expressions for lexer rules, and any combination of lexer/regex rules and/or other parser identifiers for parser rules. Each rule in is the format of <strong><em><identifier>:<expression>;</em></strong><br/></p> <p>Here's an example of the grammar:</p> <pre><code>integer : DIGIT+; //parser rule containing at least one lexer rule twodigits : DIGIT DIGIT; //parser rule containing two consecutive lexer rules DIGIT : [0-9]; //lexer rule containing regex string : '"' CHAR* '"'; //parser rule containing zero or more // lexer rules, wrapped in two string literals CHAR : (LCHAR|UCHAR); //lexer rule containing two lexer rules which // will later evaluate to one of two tokens LCHAR : [a-z]; //lexer rule containing regex UCHAR : [A-Z]; //lexer rule containing regex SPACE : ' '; //lexer rule containing string literal </code></pre> <p><br/></p> <hr> <h1>Problem</h1> <p>The trouble I'm having is the ability to match the expression strings, since their contents tend to vary.<br/> I have originally written:<br> <code>([a-zA-Z0-9_]*)(?:\s*)(?:\:)(?:\s*)((?:\'?).*(?:\'?)(?:\;))</code><br/> as the match rule, which does okay for a single string literal expression surrounded by single quotes, but I need to expand this to allow for multiple non-greedy string literals, and combined statements separated by any number of whitespace. I am not concerned with matching potential regex's within a matched expression, or even capturing segregated parts of the expression, as this is handled later on by a separate regex operation, so really I just need to <em>validate</em> identifiers and expressions...</p> <p><em><strong>All in all</em></strong>, I need the regex_search operation to look through the grammar's contents, using the following syntax for matches:</p> <ul> <li><strong>A valid identifier</strong>, starting with one or more lower or uppercase letters, optionally followed by any number of alphanumeric characters (which optionally can contain any number of underscore characters inbetween, as long as the identifier does not start or end with one).</li> <li><strong>Any number of</strong> whitespace characters, tabs, newlines etc, without capturing it.</li> <li><strong>A colon</strong> without capturing it.</li> <li><strong>Any number of</strong> whitespace characters, tabs, newlines etc, without capturing it.</li> <li><strong>At least one of</strong>: (in any order) any number of string literals (enclosed in single quotes, without capturing the quotes), any number of lexer/parser identifiers, any number of regex's (enclosed in square brackets). The result of this match rule should capture the entire expression as a single string, which will later go through a post-processing stage.</li> <li><strong>Any number of</strong> whitespace characters, tabs, newlines etc, without capturing it.</li> <li><strong>A semicolon</strong> optionally followed by any uncaptured whitespace.</li> <li><strong>Optionally, any</strong> number of uncaptured spaces followed by a single captured line comment</li> <li><strong>Any number of</strong> whitespace characters, tabs, newlines etc, without capturing it.</li> </ul> <hr> <h1>Question</h1> <p>Is it possible to place this into a single regex_search operation?<br/> I've messed around in <a href="http://www.ultrapico.com/Expresso.htm" rel="nofollow">Expresso</a> and just can't seem to get it right...</p> <hr> <h1>Update</h1> <p>So far, I've been able to come up with the following:</p> <pre><code>#///////////////////// # Identifier #///////////////////// ( (?:[a-zA-Z]+) # At least one lower/uppercase letter (?: (?:[a-zA-Z0-9_]*) # Zero or more alphanumeric/underscore characters, (?:\w+) # explicitly followed by one or more alphanumeric )? # characters ) #///////////////////// # Separator #///////////////////// (?:\s*) # Any amount of uncaptured whitespace (?:\:) # An uncaptured colon (?:\s*) # Any amount of uncaptured whitespace #/////////////////////// # Expression #/////////////////////// ( # String Literals: (?:\'?) # An optional single quote, (?: # which is meant to start and end a string (?:[^'\\] | \\.)* # literal, but issues several problems for ) # me (see comments below, after this code block) (?:\'?) # Other expressions # ???????????? ) #///////////////////// # Line End #///////////////////// (?:\s*) # Any amount of uncaptured whitespace (?:\;) # An uncaptured colon (?:\s*) # Any amount of uncaptured whitespace </code></pre> <p>As you can see, I have <em>identifiers</em>, <em>separators</em> and <em>line-ends</em> working perfectly. But expressions are where I'm totally stuck!<br/><br/> <strong>How can I tell the regex library that I want <em>EITHER</em> a non-greedy string literal, <em>OR</em> any set of characters before the Line End, <em>AND</em> any number of them in any order?</strong><br/><br/> Even if I only allowed a single string literal, how would I say <em>"The closing single quote is NOT optional if the first one exists"</em>?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload