Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><a href="http://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">Regular Expression Matching Can Be Simple And Fast</a> by Russ Cox is a great introduction to building recognizers with finite state machines. He shows how to go from regular expression to non-deterministic finite automata to deterministic finite automata. His reference implementation uses a <a href="http://en.wikipedia.org/wiki/Directed_graph" rel="nofollow">directed graph</a> (similar to a linked list but each node can have more than one "out" reference and may reference any other node including itself) versus a linked list. There are other ways to model a state machine including:</p> <ul> <li><a href="http://en.wikipedia.org/wiki/Automata-based_programming" rel="nofollow">Big switch statements and transition tables</a></li> <li><a href="http://swtch.com/~rsc/regexp/regexp2.html" rel="nofollow">A virtual machine</a></li> </ul> <p>You build a lexer/scanner by combining several recognizers. For example, imagine an assignment-only language with the following tokens defined by regular expressions:</p> <ul> <li>identifier : [a-zA-Z]+</li> <li>assignment : =</li> <li>number: [0-9]+</li> <li>keyword : and</li> </ul> <p>As you scan input left-to-right, move through the transitions in each token's machine based on the current symbol. When there are no valid transitions for a symbol, pick the last machine in an accepting state. All symbols scanned until that state are part of the token. Scanning starts again at the symbol after the last accepted symbol. If there are no valid transitions for a new token, the input is invalid and the lexer should report an error. If there is more than one machine in an accepting state, a precedence rule should decide which token is used.</p> <p>Note: these steps describe a lexer that always returns the longest possible match. You could also design a lexer that returns a token as soon as one of its machines is in an accepting state.</p> <p>Results on sample inputs:</p> <ul> <li>a=10 : (identifier a)(assignment =)(number 10)</li> <li>a = 10 : invalid - whitespace isn't accepted in our token definitions</li> <li>25=xyz : (number 25)(assignment)(identifier xyz)</li> <li>25and42 : (number 25)(keyword and)(number 42) - assumes keyword takes precedence over identifier</li> <li>b=1+2 : invalid - '+' isn't accepted in our token definitions</li> <li>andx=8 : (identifier andx)(assignment)(number 8) - longest match gives us (identifier andx) versus (keyword and)(identifier x)</li> </ul> <p>Notice that lexical analysis just returns tokens. It doesn't know if the tokens are used properly. '25=xyz' may not make any sense but we have to wait until the parsing phase to know for sure.</p> <p>As an additional resource, Dick Grune offers the first edition of <a href="http://www.cs.vu.nl/~dick/PTAPG.html" rel="nofollow">Parsing Techniques - A Practical Guide</a> as Postscript and Pdf.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload