Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <h2><em>“Answer me these questions four, as all were answered long before.”</em></h2> <p>You really should have asked one question, not four. But here are the answers.</p> <ol> <li><p>All UTF transforms <strong>by definition</strong> support <em>all</em> Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet <em>claim</em> they are UTF-16, and UCS-2 is severely broken in several fundamental ways:</p> <ul> <li>UCS-2 is not a valid Unicode encoding.</li> <li>UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.</li> <li>UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include <ul> <li>all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF</li> <li>the 32 non-character code points between U+FDD0 and U+FDEF</li> <li>both sentinels at U+FFEF and U+FFFF</li> </ul></li> </ul> <p>For what encoding is used internally by seven different programming languages, see slide 7 on <em>Feature Support Summary</em> in my OSCON talk from last week entitled <a href="http://training.perl.com/OSCON2011/index.html">“Unicode Support Shootout”</a>. It varies a great deal.</p></li> <li><p>UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:</p> <ul> <li>UTF-8 is the <em>de facto</em> standard Unicode encoding on the web.</li> <li>UTF-8 can be stored in a null-terminated string.</li> <li>UTF-8 is free of the vexing BOM issue.</li> <li>UTF-8 risks no confusion of UCS-2 vs UTF-16.</li> <li>UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.</li> <li>If need be, it <em>can</em> be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.</li> </ul> <p>I use UTF-8 whenever I can get away with it.</p></li> <li><p>I have already given properties of UTF-8, so here are some for the other two:</p> <ul> <li>UTF-32 enjoys a singular advantage for <em>internal</em> storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s <code>strlen</code> function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.</li> <li><strong>UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either.</strong> It is grudgingly true that when properly handled, UTF-16 can certainly be <em>made</em> to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there. </li> </ul> <p>That’s why I’ve come to talk about there being a <em>UTF-16 Curse</em>. The only thing worse than <em>The UTF-16 Curse</em> is <em>The UCS-2 Curse</em>. </p></li> <li><p>Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.</p></li> </ol> <p>I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. <strong>You want a code-point interface, not a code-unit interface.</strong> Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.</p> <hr> <h1>Category Errors</h1> <p>Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much <strong>NOT</strong> “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more <strong>characters that have particular rules accompanying them.</strong> Unicode characters without Unicode rules are no longer Unicode characters.</p> <p>If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, <code>re</code>, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.</p> <p>For example, Unicode defines U+03C3 <code>GREEK SMALL LETTER SIGMA</code> and U+03C2 <code>GREEK SMALL LETTER FINAL SIGMA</code> as case-insensitive versions of each other. (This is called <em>Unicode casefolding.</em>) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works. </p> <p>(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is <a href="http://pypi.python.org/pypi/regex">Matthew Barnett’s marvelous <code>regex</code> library for Python 2 and Python 3</a>. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard <code>re</code> gets miserably wrong.)</p> <p><strong>REMEMBER:</strong> Unicode is <em>not</em> just more characters: <strong>Unicode is <em>rules</em> for handling more characters.</strong> One either learns to work <em>with</em> Unicode, or else one works against it, and if one works against it, then <em>it</em> works against <em>you</em>.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload