Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Here's a script that demonstrates three separate issues:</p> <ul> <li>non-ascii characters in Python source code</li> <li>non-ascii characters in the url</li> <li>non-ascii characters in the html content</li> </ul> <pre><code># -*- coding: utf-8 -*- import urllib from StringIO import StringIO import pycurl title = u"UNIX时间" # 1 url = "https://zh.wikipedia.org/wiki/" + urllib.quote(title.encode('utf-8')) # 2 c = pycurl.Curl() c.setopt(pycurl.URL, url) c.setopt(pycurl.HTTPHEADER, ["Accept:"]) b = StringIO() c.setopt(pycurl.WRITEFUNCTION, b.write) c.setopt(pycurl.FOLLOWLOCATION, 1) c.setopt(pycurl.MAXREDIRS, 5) c.perform() data = b.getvalue() # bytes print len(data), repr(data[:200]) html_page_charset = "utf-8" # 3 html_text = data.decode(html_page_charset) print html_text[:200] # 4 </code></pre> <p>Note: all <code>utf-8</code> in the code are compeletely independent from each other.</p> <ol> <li><p>Unicode literals use whatever character encoding you defined at the top of the file. Make sure your text editor respects that setting</p></li> <li><p>Path in the url should be encoded using <code>utf-8</code> before it is percent-encoded (urlencoded)</p></li> <li><p>There are several ways to find out a html page charset. See <a href="https://en.wikipedia.org/wiki/Character_encodings_in_HTML" rel="nofollow">Character encodings in HTML</a>. Some libraries such as <a href="http://docs.python-requests.org/en/latest/" rel="nofollow"><code>requests</code></a> mentioned by @Oz123 do it automatically:</p> <pre><code># -*- coding: utf-8 -*- import requests r = requests.get(u"https://zh.wikipedia.org/wiki/UNIX时间") print len(r.content), repr(r.content[:200]) # bytes print r.encoding print r.text[:200] # Unicode </code></pre></li> <li><p><a href="http://wiki.python.org/moin/PrintFails" rel="nofollow">To print Unicode to console</a> you could use <a href="http://docs.python.org/using/cmdline.html#envvar-PYTHONIOENCODING" rel="nofollow"><code>PYTHONIOENCODING</code> environment variable</a> to set character encoding that your terminal understands</p></li> </ol> <p>See also <a href="http://www.joelonsoftware.com/Articles/Unicode.html" rel="nofollow">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a> and Python-specific <a href="http://nedbatchelder.com/text/unipain.html" rel="nofollow">Pragmatic Unicode</a>.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload