Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The problem is not with <code>codecs.open</code> -- it's with passing to <code>.write</code> a byte string that (given the <code>\xd0</code> code in it) is clearly encoded in some <code>ISO-8859-*</code> or related codec.</p> <p><a href="http://docs.python.org/library/urllib2.html?highlight=urllib2.urlopen#urllib2.urlopen" rel="nofollow noreferrer">urllib2.urlopen</a> returns a response object which, besides file-like behavior, as the extra method:</p> <blockquote> <p><code>info()</code> — return the meta-information of the page, such as headers, in the form of an <code>httplib.HTTPMessage</code> instance (see <a href="http://www.cs.tut.fi/~jkorpela/http.html" rel="nofollow noreferrer">Quick Reference to HTTP Headers</a>)</p> </blockquote> <p>In particular the <code>Content-Type</code> header, for text-like contents, should have a <code>charset</code> parameter specifying the encoding it uses, e.g. <code>Content-Type: text/html; charset=ISO-8859-4</code>. You need to parse and isolate the <code>charset</code> and use it to decode the contents into Unicode (so your <code>codecs.open</code>ed file-like object always gets unicode arguments to <code>write</code> and properly writes them out in <code>utf-8</code>).</p> <p>If <code>charset</code> is missing, or using it to decode the text results in errors (suggesting <code>charset</code> is wrong), as the last hope of salvation you can try the <a href="http://chardet.feedparser.org/" rel="nofollow noreferrer">Universal Encoding Detector</a> which uses heuristics for the purpose (after all, many pages on the web have horrible metadata errors, as well as broken HTML and so forth).</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload