Note that there are some explanatory texts on larger screens.

plurals
  1. POBest way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?
    primarykey
    data
    text
    <p>I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.</p> <p>I get from the user a URL in UTF-8. So if they've typed in <code>http://➡.ws/♥</code> I get <code>'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'</code> in Python. And what I want out is the ASCII version: <code>'http://xn--hgi.ws/%E2%99%A5'</code>.</p> <p>What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different <code>urllib.quote()</code> calls.</p> <pre><code># url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8') match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})' r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I) if not match: raise BadURLException(url) protocol, domain, port, path, query = match.groups() try: domain = unicode(domain, 'utf-8') except UnicodeDecodeError: return '' # bad UTF-8 chars in domain domain = domain.encode('idna') if port is None: port = '' path = urllib.quote(path) if query is None: query = '' else: query = urllib.quote(query, safe='=&amp;?/') url = protocol + '://' + domain + port + path + query # url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C' </code></pre> <p>Is this correct? Any better suggestions? Is there a simple standard-library function to do this?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload