Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<pre><code>s = u'中文' return s.encode('utf-8') </code></pre> <p>This returns a non-Unicode, byte string. That's what <code>encode</code> is doing. utf-8 is not a thing that magically turns data into Unicode; if anything, it's the opposite - a way of representing Unicode (an abstraction) in bytes (data, more or less).</p> <p>We need a bit of terminology here. To <strong>encode</strong> is to take a Unicode string and making a byte string that <em>represents</em> it, using some kind of encoding. To <strong>decode</strong> is the reverse: taking a byte string (that we think encodes a Unicode string), and <em>interpreting</em> it as a Unicode string, using a specified encoding. </p> <p>When we encode to a byte string and then decode using the same encoding, we get the original Unicode back.</p> <p><code>utf-8</code> is one possible encoding. There are many, many more.</p> <p>Sometimes Python will report a <code>UnicodeDecodeError</code> when you call <code>encode</code>. Why? Because you try to <code>encode</code> a byte string. The proper input for this process is a Unicode string, so Python "helpfully" tries to <code>decode</code> the byte string to Unicode first. But it doesn't know what codec to use, so it assumes <code>ascii</code>. This codec is the safest choice, in an environment where you could receive all kinds of data. It simply reports an error for bytes >= 128, which are handled in a gazillion different ways in various 8-bit encodings. (Remember trying to import a Word file with letters like <code>é</code> from a Mac to a PC or vice-versa, way back in the day? You'd get some other weird symbol on the other computer, because the platform built-in encoding was different.)</p> <p>Making things even more complicated, in Python 2 the <code>encode</code>/<code>decode</code> mechanism is also used to implement some other neat things that have nothing to do with interpreting Unicode. For example, there is a Base64 encoder, and a thing that automatically handles string escape sequences (i.e. it will change a backslash, followed by a letter 't', into a tab). Some of these <strong>do</strong> "encode" or "decode" from a byte string to a byte string, or from Unicode to Unicode.</p> <p>(<strong>By the way, this all works completely differently - much more clearly, IMHO - in Python 3.</strong>)</p> <p>Similarly, when <code>__unicode__</code> returns a byte string (which it <strong>should not</strong>, as a matter of style), the Python <code>unicode()</code> built-in function automatically decodes it as <code>ascii</code>; and when <code>__str__</code> returns a Unicode string (which again it <strong>should not</strong>), <code>str()</code> will encode it as <code>ascii</code>. This happens behind the scenes, in code you cannot control. However, you can fix <code>__unicode__</code> and <code>__str__</code> to do what they are supposed to do.</p> <p>(You can, in fact, override the encoding for <code>unicode</code>, by passing a second parameter. However, this is the wrong solution here since you should already have a Unicode string returned from <code>__unicode__</code>. And <code>str</code> doesn't take an encoding parameter, so you're out of luck there.)</p> <p>So, now we can solve the problem.</p> <p>Problem: We want <code>__unicode__</code> to return the Unicode string <code>u'中文'</code>, and we want <code>__str__</code> to return the <code>utf-8</code>-encoded version of that.</p> <p>Solution: return that string directly in <code>__unicode__</code>, and do the encoding explicitly in <code>__str__</code>:</p> <pre><code>class test(): def __unicode__(self): return u'中文' def __str__(self): return unicode(self).encode('utf-8') </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload