Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Nested or recursive structures are typically beyond the power of regexes to parse and you usually need a more powerful parser. The problem is that the next token you need to find changes depending on previous tokens, which is not something a regular expression can handle (the language is no longer regular).</p> <p>However for such a simple language you don't need a full fledged parser-generator with a formal grammar--you can easily write a simple parser by hand. You have only one bit of state that is important -- the last opened tag. If you have a regular expression that matches text, a new open tag, or the corresponding close tag to the currently open tag, you can handle this task. Rules are:</p> <ol> <li>If you match text, save the text and continue matching.</li> <li>If you match an open tag, save the open tag, and continue matching until you find an open tag or a corresponding close tag.</li> <li>If you match a close tag, stop looking for the currently open tag and continue matching for the last-unclosed tag, text, or another open tag.</li> </ol> <p>Step two is recursive--whenever you find a new open tag, you create a new matching context that looks for the corresponding close tag.</p> <p>This isn't required, but generally a parser will produce a simple tree structure to represent the parsed text--this is known as an abstract syntax tree. It's usually better to produce a syntax tree first before you produce what the syntax represents. This gives you flexibility to manipulate the tree or to produce different outputs (e.g., you could output something other than xml.)</p> <p>Here is a solution that combines both these ideas and parses your text. (It also recognizes <code>{{</code> or <code>}}</code> as escape sequences meaning a single literal <code>{</code> or <code>}</code>.)</p> <p>First the parser:</p> <pre class="lang-php prettyprint-override"><code>class ParseError extends RuntimeException {} function str_to_ast($s, $offset=0, $ast=array(), $opentag=null) { if ($opentag) { $qot = preg_quote($opentag, '%'); $re_text_suppl = '[^{'.$qot.']|{{|'.$qot.'[^}]'; $re_closetag = '|(?&lt;closetag&gt;'.$qot.'\})'; } else { $re_text_suppl = '[^{]|{{'; $re_closetag = ''; } $re_next = '% (?:\{(?P&lt;opentag&gt;[^{\s])) # match an open tag #which is "{" followed by anything other than whitespace or another "{" '.$re_closetag.' # if we have an open tag, match the corresponding close tag, e.g. "-}" |(?P&lt;text&gt;(?:'.$re_text_suppl.')+) # match text # we allow non-matching close tags to act as text (no escape required) # you can change this to produce a parseError instead %ux'; while ($offset &lt; strlen($s)) { if (preg_match($re_next, $s, $m, PREG_OFFSET_CAPTURE, $offset)) { list($totalmatch, $offset) = $m[0]; $offset += strlen($totalmatch); unset($totalmatch); if (isset($m['opentag']) &amp;&amp; $m['opentag'][1] !== -1) { list($newopen, $_) = $m['opentag']; list($subast, $offset) = str_to_ast($s, $offset, array(), $newopen); $ast[] = array($newopen, $subast); } else if (isset($m['text']) &amp;&amp; $m['text'][1] !== -1) { list($text, $_) = $m['text']; $ast[] = array(null, $text); } else if ($opentag &amp;&amp; isset($m['closetag']) &amp;&amp; $m['closetag'][1] !== -1) { return array($ast, $offset); } else { throw new ParseError("Bug in parser!"); } } else { throw new ParseError("Could not parse past offset: $offset"); } } return array($ast, $offset); } function parse($s) { list($ast, $offset) = str_to_ast($s); return $ast; } </code></pre> <p>This will produce an abstract syntax tree which is a list of "nodes", where each node is an array of the form <code>array(null, $string)</code> for text or <code>array('-', array(...))</code> (i.e. the type code and another list of nodes) for stuff inside tags.</p> <p>Once you have this tree you can do anything you want with it. For example, we can traverse it recursively to produce a DOM tree:</p> <pre class="lang-php prettyprint-override"><code>function ast_to_dom($ast, DOMNode $n = null) { if ($n === null) { $dd = new DOMDocument('1.0', 'utf-8'); $dd-&gt;xmlStandalone = true; $n = $dd-&gt;createDocumentFragment(); } else { $dd = $n-&gt;ownerDocument; } // Map of type codes to element names $typemap = array( '*' =&gt; 'strong', '/' =&gt; 'em', '-' =&gt; 's', '&gt;' =&gt; 'small', '|' =&gt; 'code', ); foreach ($ast as $astnode) { list($type, $data) = $astnode; if ($type===null) { $n-&gt;appendChild($dd-&gt;createTextNode($data)); } else { $n-&gt;appendChild(ast_to_dom($data, $dd-&gt;createElement($typemap[$type]))); } } return $n; } function ast_to_doc($ast) { $doc = new DOMDocument('1.0', 'utf-8'); $doc-&gt;xmlStandalone = true; $root = $doc-&gt;createElement('body'); $doc-&gt;appendChild($root); ast_to_dom($ast, $root); return $doc; } </code></pre> <p>Here is some test code with a more difficult test case:</p> <pre><code>$sample = "tëstïng 汉字/漢字 {{ testing -} {*strông {/ëmphäsïs {-strïkë *}also strike-}/} also {|côdë|} strong *} {*wôw*} 1, 2, 3"; $ast = parse($sample); echo ast_to_doc($ast)-&gt;saveXML(); </code></pre> <p>This will print the following:</p> <pre class="lang-xml prettyprint-override"><code>&lt;?xml version="1.0" encoding="utf-8" standalone="yes"?&gt; &lt;body&gt;tëstïng 汉字/漢字 {{ testing -} &lt;strong&gt;strông &lt;em&gt;ëmphäsïs &lt;s&gt;strïkë *}also strike&lt;/s&gt;&lt;/em&gt; also &lt;code&gt;côdë&lt;/code&gt; strong &lt;/strong&gt; &lt;strong&gt;wôw&lt;/strong&gt; 1, 2, 3&lt;/body&gt; </code></pre> <p>If you already have a <code>DOMDocument</code> and you want to add some parsed text to it, I recommend creating a <code>DOMDocumentFragment</code> and passing it to <code>ast_to_dom</code> directly, then appending this to your desired container element.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload