Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>If you would like to go the regex route...</p> <p>RFC-3986 is the authority regarding URIs. <a href="http://tools.ietf.org/html/rfc3986#appendix-B" rel="nofollow noreferrer">Appendix B</a> provides this regex to break one down into its components:</p> <pre class="lang-py prettyprint-override"><code>re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?" # Where: # scheme = $2 # authority = $4 # path = $5 # query = $7 # fragment = $9 </code></pre> <p>Here is an enhanced, Python friendly version which utilizes named capture groups. It is presented in a function within a working script:</p> <pre class="lang-py prettyprint-override"><code>import re def get_domain(url): """Return top two domain levels from URI""" re_3986_enhanced = re.compile(r""" # Parse and capture RFC-3986 Generic URI components. ^ # anchor to beginning of string (?: (?P&lt;scheme&gt; [^:/?#\s]+):// )? # capture optional scheme (?:(?P&lt;authority&gt; [^/?#\s]*) )? # capture optional authority (?P&lt;path&gt; [^?#\s]*) # capture required path (?:\?(?P&lt;query&gt; [^#\s]*) )? # capture optional query (?:\#(?P&lt;fragment&gt; [^\s]*) )? # capture optional fragment $ # anchor to end of string """, re.MULTILINE | re.VERBOSE) re_domain = re.compile(r""" # Pick out top two levels of DNS domain from authority. (?P&lt;domain&gt;[^.]+\.[A-Za-z]{2,6}) # $domain: top two domain levels. (?::[0-9]*)? # Optional port number. $ # Anchor to end of string. """, re.MULTILINE | re.VERBOSE) result = "" m_uri = re_3986_enhanced.match(url) if m_uri and m_uri.group("authority"): auth = m_uri.group("authority") m_domain = re_domain.search(auth) if m_domain and m_domain.group("domain"): result = m_domain.group("domain"); return result data_list = [ r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file", r"http://mail.domainname.org/abc/abc/aaa", r"http://domainname.edu", r"http://domainname.com:80", r"http://domainname.com?query=one", r"http://domainname.com#fragment", r"www.domainname.com#fragment", r"https://domainname.com#fragment", ] cnt = 0 for data in data_list: cnt += 1 print("Data[%d] domain = \"%s\"" % (cnt, get_domain(data))) </code></pre> <p>For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: <a href="http://jmrware.com/articles/2009/uri_regexp/URI_regex.html" rel="nofollow noreferrer">Regular Expression URI Validation</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload