Note that there are some explanatory texts on larger screens.

plurals
  1. POURLs matching regexp very slow on some strings
    primarykey
    data
    text
    <p>Here is my regexp for finding URLs in some string (i need the group for the domain because further actions are based on the domain) and i noticed for some strings 'fffffffff' in this example it's very slow, there is something obvious i missing?</p> <pre><code>&gt;&gt;&gt; URL_ALLOWED = r"[a-z0-9$-_.+!*'(),%]" &gt;&gt;&gt; URL_RE = re.compile( ... r'(?:(?:https?|ftp):\/\/)?' # protocol ... r'(?:www.)?' # www ... r'(' # host - start ... r'(?:' ... r'[a-z0-9]' # first character of domain('-' not allowed) ... r'(?:' ... r'[a-z0-0-]*' # characters in the middle of domain ... r'[a-z0-9]' # last character of domain('-' not allowed) ... r')*' ... r'\.' # dot before next part of domain name ... r')+' ... r'[a-z]{2,10}' # TLD ... r'|' # OR ... r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}' # IP address ... r')' # host - end ... r'(?::[0-9]+)?' # port ... r'(?:\/%(allowed_chars)s+/?)*' # path ... r'(?:\?(?:%(allowed_chars)s+=%(allowed_chars)s+&amp;)*' # GET params ... r'%(allowed_chars)s+=%(allowed_chars)s+)?' # last GET param ... r'(?:#[^\s]*)?' % { # anchor ... 'allowed_chars': URL_ALLOWED ... }, ... re.IGNORECASE ... ) &gt;&gt;&gt; from time import time &gt;&gt;&gt; strings = [ ... 'foo bar baz', ... 'blah blah blah blah blah blah', ... 'f' * 10, ... 'f' * 20, ... 'f' * 30, ... 'f' * 40, ... ] &gt;&gt;&gt; def t(): ... for string in strings: ... t1 = time() ... URL_RE.findall(string) ... print string, time() - t1 ... &gt;&gt;&gt; t() foo bar baz 3.91006469727e-05 blah blah blah blah blah blah 6.98566436768e-05 ffffffffff 0.000313997268677 ffffffffffffffffffff 0.183916091919 ffffffffffffffffffffffffffffff 178.445468903 </code></pre> <p>Yeah i know there is another solution to use very simple regexp (word that contain dots for example) and use urlparse later to get domain, but urlparse doesn't work as expected when we don't have protocol in URL:</p> <pre><code>&gt;&gt;&gt; urlparse('example.com') ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='') &gt;&gt;&gt; urlparse('http://example.com') ParseResult(scheme='http', netloc='example.com', path='', params='', query='', fragment='') &gt;&gt;&gt; urlparse('example.com/test/test') ParseResult(scheme='', netloc='', path='example.com/test/test', params='', query='', fragment='') &gt;&gt;&gt; urlparse('http://example.com/test/test') ParseResult(scheme='http', netloc='example.com', path='/test/test', params='', query='', fragment='') &gt;&gt;&gt; urlparse('example.com:1234/test/test') ParseResult(scheme='example.com', netloc='', path='1234/test/test', params='', query='', fragment='') &gt;&gt;&gt; urlparse('http://example.com:1234/test/test') ParseResult(scheme='http', netloc='example.com:1234', path='/test/test', params='', query='', fragment='') </code></pre> <p>Yeah prepending http:// is also a solution(i'm still not 100% sure if there are no other urlparse issues) but i'm curious what's wrong with this regexp anyway</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload