StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PORegex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
primarykey
Id
1178173
data
AcceptedAnswerId
1787243
AnswerCount
5
ClosedDate
CommentCount
3
CommunityOwnedDate
2010-03-16T12:35:17.410
CreationDate
2009-07-24T14:46:15.653
FavoriteCount
10
LastActivityDate
2016-07-20T20:22:17.393
LastEditDate
2016-07-20T20:22:17.393
LastEditorUserId
42921
OwnerUserId
42921
ParentId
0
PostTypeId
1
Score
15
ViewCount
1335
LastEditorDisplayName
text
Body
Edit 2: For a practical demonstration of why this remains important, look no further than <a href="http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016" rel="nofollow noreferrer">stackoverflow's own regex-caused outage today (2016-07-20)</a>! Edit: This question has considerably evolved since I first asked it. See below for two fast+compatible, but not completely fully featured implementations. If you know of more or better implementations, please mention them, there still isn't an ideal implementation here yet! <h1>Where can I find reliably fast Regex implementation?</h1> Does anyone know of a normal non-backtracking (<code>System.Text.RegularExpressions</code> backtracks) linear time regex implementation either for .NET or native and reasonably usable from .NET? To be useful, it would need to: <ul> <li>have a worst case time-complexity of regex evaluation of O(m*n) where m is the length of the regex, and n the length of the input.</li> <li>have a normal time-complexity of O(n), since almost no regular expressions actually trigger the exponential state-space, or, if they can, only do so on a minute subset of the input.</li> <li>have a reasonable construction speed (i.e. no potentially exponential DFA's)</li> <li>be intended for use by human beings, not mathematicians - e.g. I don't want to reimplement unicode character classes: .NET or PCRE style character classes are a plus.</li> </ul> <h2>Bonus Points:</h2> <ul> <li>bonus points for practicality if it implements stack-based features which let it handle nesting at the expense of consuming O(n+m) memory rather than O(m) memory.</li> <li>bonus points for either capturing subexpressions or replacements (if there are an exponential number of possible subexpression matches, then enumerating all of them is inherently exponential - but enumerating the first few shouldn't be, and similarly for replacements). You can workaround missing either feature by using the other, so having either one is sufficient.</li> <li>lotsa bonus points for treating regexes as first class values (so you can take the union, intersection, concatenation, negation - in particular negation and intersection as those are very hard to do by string manipulation of the regex definition)</li> <li>lazy matching i.e. matching on unlimited streams without putting it all in memory is a plus. If the streams don't support seeking, capturing subexpressions and/or replacements aren't (in general) possible in a single pass.</li> <li>Backreferences are out, they are fundamentally unreliable; i.e. can always exhibit exponential behavior given pathological input cases.</li> </ul> Such algorithms exist (This is basic automata theory...) - but are there any practically usable implementations accessible from .NET? <h2>Background: (you can skip this)</h2> I like using Regex's for quick and dirty text clean-ups, but I've repeatedly run into issues where the common backtracking NFA implemtation used by perl/java/python/.NET shows exponential behavior. These cases are unfortunately rather easy to trigger as soon as you start automatically generating your regular expressions. Even non-exponential performance can become exceedingly poor when you alternate between regexes that match the same prefix - for instance, in a really basic example, if you take a dictionary and turn it into a regular expression, expect terrible performance. For a quick overview of why better implementations exist and have since the 60s, see <a href="http://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow noreferrer">Regular Expression Matching Can Be Simple And Fast</a>. <h2>Not quite practical options:</h2> <ul> <li>Almost ideal: <a href="http://www.let.rug.nl/vannoord/Fsa/" rel="nofollow noreferrer">FSA toolkit</a>. Can compile regexes to fast C implementations of DFA's+NFA's, allows transducers(!) too, has first class regexes (encapsulation yay!) including syntax for intersection and parametrization. But it's in prolog... (why is something with this kind of practical features not available in a mainstream language???)</li> <li>Fast but impractical: a full parser, such as the excellent <a href="http://www.antlr.org/" rel="nofollow noreferrer">ANTLR</a> generally supports reliably fast regexes. However, antlr's syntax is far more verbose, and of course permits constructs that may not generate valid parsers, so you'd need to find some safe subset.</li> </ul> <h1>Good implementations:</h1> <ul> <li><a href="http://code.google.com/p/re2/" rel="nofollow noreferrer">RE2</a> - a google open source library aiming for reasonable PCRE compatibility minus backreferences. I think this is the successor to the unix port of plan9's regex lib, given the author.</li> <li><a href="http://laurikari.net/tre/" rel="nofollow noreferrer">TRE</a> - also mostly compatible with PCRE and even does backreferences, although using those you lose speed guarantees. And it has a mega-nifty approximate matching mode!</li> </ul> Unfortunately both implementations are C++ and would require interop to use from .NET.
Tags
<.net><regex><performance><big-o>
Title
Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USEamon Nerbonne
UserOwnerUserId
1. USEamon Nerbonne
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PORegex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
 UserUserId
 USAhmad Mageed
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 PORegex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
 UserUserId
 USEamon Nerbonne
 VoteTypeVoteTypeId
 VTBountyStart
3. VO
 singulars
 PostPostId
 PORegex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.