Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>That's all part of the UTF8 encoding (which is only one encoding scheme for Unicode).</p> <p>The size can figured out by examining the first byte as follows:</p> <ul> <li>if it starts with bit pattern <code>"10" (0x80-0xbf)</code>, it's not the first byte of a sequence and you should back up until you find the start, any byte that starts with "0" or "11" (thanks to Jeffrey Hantin for pointing that out in the comments).</li> <li>if it starts with bit pattern <code>"0" (0x00-0x7f)</code>, it's 1 byte.</li> <li>if it starts with bit pattern <code>"110" (0xc0-0xdf)</code>, it's 2 bytes.</li> <li>if it starts with bit pattern <code>"1110" (0xe0-0xef)</code>, it's 3 bytes.</li> <li>if it starts with bit pattern <code>"11110" (0xf0-0xf7)</code>, it's 4 bytes.</li> </ul> <p>I'll duplicate the table showing this, but the original is on the Wikipedia UTF8 page <a href="http://en.wikipedia.org/wiki/UTF-8" rel="noreferrer">here</a>.</p> <pre><code>+----------------+----------+----------+----------+----------+ | Unicode | Byte 1 | Byte 2 | Byte 3 | Byte 4 | +----------------+----------+----------+----------+----------+ | U+0000-007F | 0xxxxxxx | | | | | U+0080-07FF | 110yyyxx | 10xxxxxx | | | | U+0800-FFFF | 1110yyyy | 10yyyyxx | 10xxxxxx | | | U+10000-10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx | +----------------+----------+----------+----------+----------+ </code></pre> <p>The Unicode characters in the above table are constructed from the bits:</p> <pre><code>000z-zzzz yyyy-yyyy xxxx-xxxx </code></pre> <p>where the <code>z</code> and <code>y</code> bits are assumed to be zero where they're not given. Some bytes are considered illegal as a start byte since they're either:</p> <ul> <li>useless: a 2-byte sequence starting with 0xc0 or 0xc1 actually gives a code point less than 0x80 which can be represented better with a 1-byte sequence.</li> <li>used by RFC3629 for 4-byte sequence above U+10FFFF, or 5-byte and 6-byte sequences. These are the bytes 0xf5 through 0xfd.</li> <li>just unused: bytes 0xfe and 0xff.</li> </ul> <p>In addition, subsequent bytes in a multi-byte sequence that don't begin with the bits "10" are also illegal.</p> <p>As an example, consider the sequence [0xf4,0x8a,0xaf,0x8d]. This is a 4-byte sequence as the first byte falls between 0xf0 and 0xf7.</p> <pre><code> 0xf4 0x8a 0xaf 0x8d = 11110100 10001010 10101111 10001101 zzz zzyyyy yyyyxx xxxxxx = 1 0000 1010 1011 1100 1101 z zzzz yyyy yyyy xxxx xxxx = U+10ABCD </code></pre> <p>For your specific query with the first byte 0xe6 (length = 3), the byte sequence is:</p> <pre><code> 0xe6 0xbe 0xb3 = 11100110 10111110 10110011 yyyy yyyyxx xxxxxx = 01101111 10110011 yyyyyyyy xxxxxxxx = U+6FB3 </code></pre> <p>If you look that code up <a href="http://www.cojak.org/index.php?function=code_lookup&amp;term=6FB3" rel="noreferrer">here</a>, you'll see it's the one you had in your question: 澳.</p> <p>To show how the decoding works, I went back to my archives to find my UTF8 handling code. I've had to morph it a bit to make it a complete program and the encoding has been removed (since the question was really about decoding), so I hope I haven't introduced any errors from the cut and paste:</p> <pre><code>#include &lt;stdio.h&gt; #include &lt;string.h&gt; #define UTF8ERR_TOOSHORT -1 #define UTF8ERR_BADSTART -2 #define UTF8ERR_BADSUBSQ -3 typedef unsigned char uchar; static int getUtf8 (uchar *pBytes, int *pLen) { if (*pLen &lt; 1) return UTF8ERR_TOOSHORT; /* 1-byte sequence */ if (pBytes[0] &lt;= 0x7f) { *pLen = 1; return pBytes[0]; } /* Subsequent byte marker */ if (pBytes[0] &lt;= 0xbf) return UTF8ERR_BADSTART; /* 2-byte sequence */ if ((pBytes[0] == 0xc0) || (pBytes[0] == 0xc1)) return UTF8ERR_BADSTART; if (pBytes[0] &lt;= 0xdf) { if (*pLen &lt; 2) return UTF8ERR_TOOSHORT; if ((pBytes[1] &amp; 0xc0) != 0x80) return UTF8ERR_BADSUBSQ; *pLen = 2; return ((int)(pBytes[0] &amp; 0x1f) &lt;&lt; 6) | (pBytes[1] &amp; 0x3f); } /* 3-byte sequence */ if (pBytes[0] &lt;= 0xef) { if (*pLen &lt; 3) return UTF8ERR_TOOSHORT; if ((pBytes[1] &amp; 0xc0) != 0x80) return UTF8ERR_BADSUBSQ; if ((pBytes[2] &amp; 0xc0) != 0x80) return UTF8ERR_BADSUBSQ; *pLen = 3; return ((int)(pBytes[0] &amp; 0x0f) &lt;&lt; 12) | ((int)(pBytes[1] &amp; 0x3f) &lt;&lt; 6) | (pBytes[2] &amp; 0x3f); } /* 4-byte sequence */ if (pBytes[0] &lt;= 0xf4) { if (*pLen &lt; 4) return UTF8ERR_TOOSHORT; if ((pBytes[1] &amp; 0xc0) != 0x80) return UTF8ERR_BADSUBSQ; if ((pBytes[2] &amp; 0xc0) != 0x80) return UTF8ERR_BADSUBSQ; if ((pBytes[3] &amp; 0xc0) != 0x80) return UTF8ERR_BADSUBSQ; *pLen = 4; return ((int)(pBytes[0] &amp; 0x0f) &lt;&lt; 18) | ((int)(pBytes[1] &amp; 0x3f) &lt;&lt; 12) | ((int)(pBytes[2] &amp; 0x3f) &lt;&lt; 6) | (pBytes[3] &amp; 0x3f); } return UTF8ERR_BADSTART; } static uchar htoc (char *h) { uchar u = 0; while (*h != '\0') { if ((*h &gt;= '0') &amp;&amp; (*h &lt;= '9')) u = ((u &amp; 0x0f) &lt;&lt; 4) + *h - '0'; else if ((*h &gt;= 'a') &amp;&amp; (*h &lt;= 'f')) u = ((u &amp; 0x0f) &lt;&lt; 4) + *h + 10 - 'a'; else return 0; h++; } return u; } int main (int argCount, char *argVar[]) { int i; uchar utf8[4]; int len = argCount - 1; if (len != 4) { printf ("Usage: utf8 &lt;hex1&gt; &lt;hex2&gt; &lt;hex3&gt; &lt;hex4&gt;\n"); return 1; } printf ("Input: (%d) %s %s %s %s\n", len, argVar[1], argVar[2], argVar[3], argVar[4]); for (i = 0; i &lt; 4; i++) utf8[i] = htoc (argVar[i+1]); printf (" Becomes: (%d) %02x %02x %02x %02x\n", len, utf8[0], utf8[1], utf8[2], utf8[3]); if ((i = getUtf8 (&amp;(utf8[0]), &amp;len)) &lt; 0) printf ("Error %d\n", i); else printf (" Finally: U+%x, with length of %d\n", i, len); return 0; } </code></pre> <p>You can run it with your sequence of bytes (you'll need 4 so use 0 to pad them out) as follows:</p> <pre><code>&gt; utf8 f4 8a af 8d Input: (4) f4 8a af 8d Becomes: (4) f4 8a af 8d Finally: U+10abcd, with length of 4 &gt; utf8 e6 be b3 0 Input: (4) e6 be b3 0 Becomes: (4) e6 be b3 00 Finally: U+6fb3, with length of 3 &gt; utf8 41 0 0 0 Input: (4) 41 0 0 0 Becomes: (4) 41 00 00 00 Finally: U+41, with length of 1 &gt; utf8 87 0 0 0 Input: (4) 87 0 0 0 Becomes: (4) 87 00 00 00 Error -2 &gt; utf8 f4 8a af ff Input: (4) f4 8a af ff Becomes: (4) f4 8a af ff Error -3 &gt; utf8 c4 80 0 0 Input: (4) c4 80 0 0 Becomes: (4) c4 80 00 00 Finally: U+100, with length of 2 </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload