Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Outside of the syntax clarification on <code>re.match</code>, I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string. </p> <p>Recall for a moment that a Python regex really is a type of computer program. Most modern forms, including Python's regex, are based on Perl. Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Indeed a rogue regex can be used as a form of <a href="http://en.wikipedia.org/wiki/ReDoS" rel="noreferrer">denial of service attack</a>. </p> <p>To see of this on your own computer, try:</p> <pre><code>&gt;&gt;&gt; re.match(r'^(a+)+$','a'*24+'!') </code></pre> <p>That takes about 1 second on my computer. Now increase the <code>24</code> in <code>'a'*24</code> to a bit larger number, say <code>28</code>. That take a lot longer. Try <code>48</code>... You will probably need to <kbd>CTRL</kbd>+<kbd>C</kbd> now. The time increase as the number of a's increase is, in fact, exponential. </p> <p>You can read more about this issue in <a href="http://swtch.com/~rsc/" rel="noreferrer">Russ Cox</a>'s wonderful paper on <a href="http://swtch.com/~rsc/regexp/regexp1.html" rel="noreferrer">'Regular Expression Matching Can Be Simple And Fast'</a>. Russ Cox is the Goggle engineer that built <a href="http://www.google.com/codesearch" rel="noreferrer">Google Code Search</a> in 2006. As Cox observes, consider matching the regex <code>'a?'*33 + 'a'*33</code> against the string of <code>'a'*99</code> with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 10<sup>15</sup> <em>years</em> because of exponential back tracking.</p> <p>So the conclusion is: <em>it depends!</em> What do you mean by a <em>more specific</em> match? Look at some of Cox's regex simplification techniques in <a href="http://swtch.com/~rsc/regexp/regexp3.html" rel="noreferrer">RE2</a>. If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (i.e., no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways. </p> <p>If you are looking for a simple way to state that (regex_3 &lt; regex_1 &lt; regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (i.e., <a href="http://perl.plover.com/NPC/NPC-3SAT.html" rel="noreferrer">this problem</a> is <a href="http://en.wikipedia.org/wiki/NP-complete" rel="noreferrer">NP Complete</a>)</p> <p><strong>Edit</strong></p> <p><strong>Everything I said above is true!</strong> However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is. </p> <p>You be the judge if this works (I don't know what 'specific' means to you for your application):</p> <pre><code>import re def ld(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n &gt; m: # Make sure n &lt;= m, to use O(min(n,m)) space a,b = b,a n,m = m,n current = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if a[j-1] != b[i-1]: change = change + 1 current[j] = min(add, delete, change) return current[n] s='Mary had a little lamb' d={} regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb', r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little'] for reg in regs: m=re.search(reg,s) if m: print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0)) ld1=ld(reg,m.group(0)) ld2=ld(m.group(0),s) score=max(ld1,ld2) print " %i edits regex-&gt;match(0), %i edits match(0)-&gt;s" % (ld1,ld2) print " score: ", score d[reg]=score print else: print "'%s' does not match '%s'" % (reg, s) print " ===== %s ===== === %s ===" % ('RegEx'.center(10),'Score'.center(10)) for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)): print " %22s %5s" % (key, value) </code></pre> <p>The program is taking a list of regex's and matching against the string <code>Mary had a little lamb</code>. </p> <p>Here is the sorted ranking from "most specific" to "least specific":</p> <pre><code> ===== RegEx ===== === Score === Mary had a little lamb 0 Mary.*little lamb 7 .*little lamb 11 little lamb 11 .*[lL]ittle [Ll]amb 15 \blittle\b 16 little 16 Mary 18 \b\w+mb 18 lamb 18 .* 22 </code></pre> <p>This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; b) the edits to get from the matching substring to the initial string. (just take one)</p> <p>As two simple examples: </p> <ol> <li><code>.*</code> (or <code>.*.*</code> or <code>.*?.*</code> etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. This is the max possible edits, the highest score, and the least 'specific' regex. </li> <li>The regex of the string itself against the string is as specific as possible. No edits to change one to the other resulting in a 0 or lowest score.</li> </ol> <p>As stated, this is simplistic. Anchors should increase specificity but they do not in this case. Very short stings don't work because the wild-card may be longer than the string. </p> <p><strong>Edit 2</strong></p> <p>I got anchor parsing to work pretty darn well using the undocumented <code>sre_parse</code> module in Python. Type <code>&gt;&gt;&gt; help(sre_parse)</code> if you want to read more...</p> <p>This is the goto worker module underlying the <code>re</code> module. It has been in every Python distribution since 2001 including all the P3k versions. It <em>may</em> go away, but I don't think it is likely...</p> <p>Here is the revised listing:</p> <pre><code>import re import sre_parse def ld(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n &gt; m: # Make sure n &lt;= m, to use O(min(n,m)) space a,b = b,a n,m = m,n current = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if a[j-1] != b[i-1]: change = change + 1 current[j] = min(add, delete, change) return current[n] s='Mary had a little lamb' d={} regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb', r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little', r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb', r'.*\blittle\b \blamb$','^'+s+'$'] for reg in regs: m=re.search(reg,s) if m: ld1=ld(reg,m.group(0)) ld2=ld(m.group(0),s) score=max(ld1,ld2) for t, v in sre_parse.parse(reg): if t=='at': # anchor... if v=='at_beginning' or 'at_end': score-=1 # ^ or $, adj 1 edit if v=='at_boundary': # all other anchors are 2 char score-=2 d[reg]=score else: print "'%s' does not match '%s'" % (reg, s) print print " ===== %s ===== === %s ===" % ('RegEx'.center(15),'Score'.center(10)) for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)): print " %27s %5s" % (key, value) </code></pre> <p>And soted RegEx's:</p> <pre><code> ===== RegEx ===== === Score === Mary had a little lamb 0 ^Mary had a little lamb$ 0 .*\blittle\b \blamb$ 6 Mary.*little lamb 7 .*\b[lL]ittle\b \b[Ll]amb 10 \blittle\b 10 .*little lamb 11 little lamb 11 .*[lL]ittle [Ll]amb 15 \b\w+mb 15 little 16 ^.*lamb 17 Mary 18 lamb 18 .*.*.*b 21 .* 22 .*?.* 22 </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload