StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
6884203
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
8
CommunityOwnedDate
CreationDate
2011-07-30T15:42:02.730
FavoriteCount
0
LastActivityDate
2011-07-30T15:42:02.730
LastEditDate
LastEditorUserId
0
OwnerUserId
471272
ParentId
6882301
PostTypeId
2
Score
18
ViewCount
0
LastEditorDisplayName
text
Body
<h2>“Answer me these questions four, as all were answered long before.”</h2> You really should have asked one question, not four. But here are the answers. <ol> <li>All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways: <ul> <li>UCS-2 is not a valid Unicode encoding.</li> <li>UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.</li> <li>UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include <ul> <li>all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF</li> <li>the 32 non-character code points between U+FDD0 and U+FDEF</li> <li>both sentinels at U+FFEF and U+FFFF</li> </ul></li> </ul> For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled <a href="http://training.perl.com/OSCON2011/index.html">“Unicode Support Shootout”</a>. It varies a great deal.</li> <li>UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order: <ul> <li>UTF-8 is the de facto standard Unicode encoding on the web.</li> <li>UTF-8 can be stored in a null-terminated string.</li> <li>UTF-8 is free of the vexing BOM issue.</li> <li>UTF-8 risks no confusion of UCS-2 vs UTF-16.</li> <li>UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.</li> <li>If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.</li> </ul> I use UTF-8 whenever I can get away with it.</li> <li>I have already given properties of UTF-8, so here are some for the other two: <ul> <li>UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s <code>strlen</code> function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.</li> <li>UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there. </li> </ul> That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse. </li> <li>Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.</li> </ol> I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden. <hr> <h1>Category Errors</h1> Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters. If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, <code>re</code>, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess. For example, Unicode defines U+03C3 <code>GREEK SMALL LETTER SIGMA</code> and U+03C2 <code>GREEK SMALL LETTER FINAL SIGMA</code> as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works. (And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is <a href="http://pypi.python.org/pypi/regex">Matthew Barnett’s marvelous <code>regex</code> library for Python 2 and Python 3</a>. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard <code>re</code> gets miserably wrong.) REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhat is the Best UTF
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. UStchrist
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.