Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>…there's a reason they're called "encodings"…</p> <p>A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.</p> <p>In order for a computer to store and-or manipulate Unicode, it has to <em>encode</em> it into bytes. The most straightforward <em>encoding</em> of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such <em>encodings</em> use only one byte per character.</p> <p>Basically, Unicode can be <em>encoded</em> with many encodings, and encoded strings can be <em>decoded</em> to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit <em>character set</em> learned too late that all this time we worked with <em>encoded</em> strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.</p> <p>So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already <em>encoded</em> according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to <em>decode</em> the string from the "cp1252" encoding.</p> <p>So, what you meant to do, was:</p> <pre><code>"add \x93Monitoring\x94 to list".decode("cp1252", "ignore") </code></pre> <p>It's unfortunate that Python 2.x includes an <code>.encode</code> method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.</p> <p>Anyway, all you have to remember for your to-and-fro Unicode conversions is:</p> <ul> <li>a Unicode string gets <em>encoded</em> to a Python 2.x string (actually, a sequence of bytes)</li> <li>a Python 2.x string gets <em>decoded</em> to a Unicode string</li> </ul> <p>In both cases, you need to specify the <em>encoding</em> that will be used.</p> <p>I'm not very clear, I'm sleepy, but I sure hope I help.</p> <p>PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)</p> <p>PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications. Cheers!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload