Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>In <strong>UTF-8</strong> all ASCII characters under <code>127</code> are represented by one byte (binary representation of <code>0xxxxxxx</code>) and <strong>code points</strong> larger than <code>127</code> are represented by <strong>multi-byte</strong> sequences. Multi-byte sequences are composed of a <strong>leading byte</strong> and one or more <strong>continuation bytes</strong>.</p> <p>The leading byte's <strong>high order</strong> bits serve to tell us how many continuation bytes to use and for that purpose it has two or more high-order 1s followed by a 0, i.e. the high bits can be <code>110</code> or <code>1110</code> or <code>11110</code> or <code>111110</code>. The number of the high-order bits are equal to the sum of the leading byte plus the continuation bytes, i.e.</p> <pre><code>110 means 1 leading byte + 1 continuation byte 1110 means 1 leading byte + 2 continuation bytes 11110 means 1 leading byte + 3 continuation bytes </code></pre> <p>Continuation bytes which follow a leading byte have the format <code>10xxxxxx</code>.</p> <p>Applying the above to your <code>$test</code> string:</p> <p>We have three bytes <code>ord('X')</code> that all are <strong>ascii</strong> chars under <code>127</code>, so those are counted as 1 char to 1 byte,</p> <p>Then we have a <code>chr(241)</code> with binary representation of 11110001 so it's a <strong>leading byte</strong> since it has two or more high-bits.</p> <p>Since it has 4 high bits that means that the code point it represents consists of 1 <strong>leading byte</strong> plus 3 <strong>continuation bytes</strong>, so the 3 <code>ord('X')</code> bytes that remain in the string are considered by <code>mb_strlen()</code> as continuation bytes* and although together with the chr(241) are a total of four bytes they are counted as one UTF-8 code point. </p> <p>*Here we must state that those trailing 'X's are not valid continuation bytes since they do not conform to the standard of a continuation byte. However <code>mb_strlen()</code> will consume as explained above up to 3 more bytes after the <code>chr(241)</code>. You can test this if you add another <code>'X</code>' or you subtract <code>'X's</code> from the end of the <code>$test</code> string.</p> <p><strong>UPDATE: Verifying the findings:</strong></p> <pre><code>/* * The following strings are non valid UTF-8 encodings. * We test to see if mb_strlen() consumes non VALID UTF-8 * byte strings like they are valid (driven by the leading bytes) * */ /* * 0xc0 as a leading byte should consume one continuation byte * so the length reported should be 6 */ $test = 'XXX' . chr(0xc0) . 'XXX'; echo '6 == ', mb_strlen($test, 'UTF8'); /* * 0xe0 as a leading byte should consume two continuation bytes * so the length reported should be 5 */ $test = 'XXX' . chr(0xe0) . 'XXX'; echo '5 == ', mb_strlen($test, 'UTF8'), PHP_EOL; // results in 6 == 6 and 5 == 5 </code></pre> <p><strong>UPDATE 2</strong>:</p> <p>An example of constructing with <code>chr()</code> the same symbol in Latin-1 and UTF-8.</p> <pre><code>$euroSignAscii = chr(0x80); // Latin-1 extended ASCII $euroSignUtf8 = chr(0xe2) . chr(0x82) . chr(0xac); // UTF-8 </code></pre> <p>Take note if you echo the above strings the encoding of your console or web page (if it is latin-1 then the <code>$euroSignAscii</code> will output correctly, if it is UTF-8 then the <code>$euroSignUtf8</code> will output correctly).</p> <hr> <p><strong>Links:</strong></p> <p>A good reference is the relevant <a href="http://en.wikipedia.org/wiki/UTF-8" rel="nofollow">UTF-8 article on Wikipedia</a></p> <p>A classic post from Joel Spolsky <a href="http://www.joelonsoftware.com/articles/Unicode.html" rel="nofollow">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a></p> <p>And to get the feel <a href="http://www.utf8-chartable.de/unicode-utf8-table.pl" rel="nofollow">UTF-8 encoding table and Unicode characters</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload