Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
    primarykey
    data
    text
    <p>There is no day on SO that passes without a question about parsing (X)HTML or XML with regular expressions being asked. </p> <p>While it's relatively easy to come up with <a href="https://stackoverflow.com/q/701166/146792">examples that demonstrates the non-viability of regexes for this task</a> or with a <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">collection of expressions</a> to represent the concept, I could still not find on SO a <strong>formal</strong> explanation of why this is not possible done in layman's terms.</p> <p>The only formal explanations I could find so far on this site are probably extremely accurate, but also quite cryptic to the self-taught programmer:</p> <blockquote> <p>the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression)</p> </blockquote> <p>or:</p> <blockquote> <p>Regular expressions can only match regular languages but HTML is a context-free language.</p> </blockquote> <p>or:</p> <blockquote> <p>A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.</p> </blockquote> <p>or:</p> <blockquote> <p>The Pumping lemma for regular languages is the reason why you can't do that.</p> </blockquote> <p>[To be fair: the majority of the above explanation link to wikipedia pages, but these are not much easier to understand than the answers themselves].</p> <p>So my question is: <strong>could somebody please provide a translation in layman's terms of the formal explanations given above of why it is not possible to use regex for parsing (X)HTML/XML?</strong></p> <p><strong>EDIT:</strong> After reading the first answer I thought that I should clarify: I am looking for a "translation" that also briefely <em>explains</em> the concepts it tries to translate: at the end of an answer, the reader should have a rough idea - for example - of what "regular language" and "context-free grammar" mean...</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload