Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You can use <strong>mb_convert_encoding()</strong> or <strong>htmlspecialchars()</strong>'s <strong>ENT_SUBSTITUTE</strong> option since PHP 5.4. Of cource you can use <strong>preg_match()</strong> too. If you use intl, you can use <strong><a href="https://wiki.php.net/rfc/uconverter" rel="noreferrer">UConverter</a></strong> since PHP 5.5.</p> <p>Recommended substitute character for invalid byte sequence is <strong>U+FFFD</strong>. see "<a href="http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences" rel="noreferrer">3.1.2 Substituting for Ill-Formed Subsequences</a>" in UTR #36: Unicode Security Considerations for the details.</p> <p>When using <strong>mb_convert_encoding()</strong>, you can specify a substitute character by passing Unicode code point to <strong>mb_substitute_character()</strong> or <strong>mbstring.substitute_character</strong> directive. The default character for substitution is ? (QUESTION MARK - U+003F).</p> <pre><code>// REPLACEMENT CHARACTER (U+FFFD) mb_substitute_character(0xFFFD); function replace_invalid_byte_sequence($str) { return mb_convert_encoding($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence2($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); } </code></pre> <p><strong>UConverter</strong> offers both procedual and object-oriented API. </p> <pre><code>function replace_invalid_byte_sequence3($str) { return UConverter::transcode($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence4($str) { return (new UConverter('UTF-8', 'UTF-8'))-&gt;convert($str); } </code></pre> <p>When using <strong>preg_match()</strong>, you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.</p> <pre><code>lead byte: 0x00 - 0x7F, 0xC2 - 0xF4 trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F) </code></pre> <p>you can refer to the following resources for checking the byte range.</p> <ol> <li>"<a href="http://tools.ietf.org/html/rfc3629#section-4" rel="noreferrer">Syntax of UTF-8 Byte Sequences</a>" in RFC 3629</li> <li>"<a href="http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf" rel="noreferrer">Table 3-7. Well-Formed UTF-8 Byte Sequences</a>" in the Unicode Standard 6.1</li> <li>"<a href="http://www.w3.org/International/questions/qa-forms-utf-8.en.php" rel="noreferrer">Multilingual form encoding</a>" in W3C Internationalization"</li> </ol> <p>The byte range table is the below.</p> <pre><code> Code Points First Byte Second Byte Third Byte Fourth Byte U+0000 - U+007F 00 - 7F U+0080 - U+07FF C2 - DF 80 - BF U+0800 - U+0FFF E0 A0 - BF 80 - BF U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF U+D000 - U+D7FF ED 80 - 9F 80 - BF U+E000 - U+FFFF EE - EF 80 - BF 80 - BF U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF </code></pre> <p>How to replace invalid byte sequence without breaking valid characters is shown in "<a href="http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences" rel="noreferrer">3.1.1 Ill-Formed Subsequences</a>" in UTR #36: Unicode Security Considerations and "<a href="http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf" rel="noreferrer">Table 3-8. Use of U+FFFD in UTF-8 Conversion</a>" in The Unicode Standard.</p> <p>The Unicode Standard shows an example:</p> <pre><code>before: &lt;61 F1 80 80 E1 80 C2 62 80 63 80 BF 64 &gt; after: &lt;0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064&gt; </code></pre> <p>Here is the implementation by <strong>preg_replace_callback()</strong> according to the above rule.</p> <pre><code>function replace_invalid_byte_sequence5($str) { // REPLACEMENT CHARACTER (U+FFFD) $substitute = "\xEF\xBF\xBD"; $regex = '/ ([\x00-\x7F] # U+0000 - U+007F |[\xC2-\xDF][\x80-\xBF] # U+0080 - U+07FF | \xE0[\xA0-\xBF][\x80-\xBF] # U+0800 - U+0FFF |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # U+1000 - U+CFFF | \xED[\x80-\x9F][\x80-\xBF] # U+D000 - U+D7FF | \xF0[\x90-\xBF][\x80-\xBF]{2} # U+10000 - U+3FFFF |[\xF1-\xF3][\x80-\xBF]{3} # U+40000 - U+FFFFF | \xF4[\x80-\x8F][\x80-\xBF]{2}) # U+100000 - U+10FFFF |(\xE0[\xA0-\xBF] # U+0800 - U+0FFF (invalid) |[\xE1-\xEC\xEE\xEF][\x80-\xBF] # U+1000 - U+CFFF (invalid) | \xED[\x80-\x9F] # U+D000 - U+D7FF (invalid) | \xF0[\x90-\xBF][\x80-\xBF]? # U+10000 - U+3FFFF (invalid) |[\xF1-\xF3][\x80-\xBF]{1,2} # U+40000 - U+FFFFF (invalid) | \xF4[\x80-\x8F][\x80-\xBF]?) # U+100000 - U+10FFFF (invalid) |(.) # invalid 1-byte /xs'; // $matches[1]: valid character // $matches[2]: invalid 3-byte or 4-byte character // $matches[3]: invalid 1-byte $ret = preg_replace_callback($regex, function($matches) use($substitute) { if (isset($matches[2]) || isset($matches[3])) { return $substitute; } return $matches[1]; }, $str); return $ret; } </code></pre> <p>You can compare byte directly and avoid preg_match's restriction about byte size by this way.</p> <pre><code>function replace_invalid_byte_sequence6($str) { $size = strlen($str); $substitute = "\xEF\xBF\xBD"; $ret = ''; $pos = 0; $char; $char_size; $valid; while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) { $ret .= $valid ? $char : $substitute; } return $ret; } function utf8_get_next_char($str, $str_size, &amp;$pos, &amp;$char, &amp;$char_size, &amp;$valid) { $valid = false; if ($str_size &lt;= $pos) { return false; } if ($str[$pos] &lt; "\x80") { $valid = true; $char_size = 1; } else if ($str[$pos] &lt; "\xC2") { $char_size = 1; } else if ($str[$pos] &lt; "\xE0") { if (!isset($str[$pos+1]) || $str[$pos+1] &lt; "\x80" || "\xBF" &lt; $str[$pos+1]) { $char_size = 1; } else { $valid = true; $char_size = 2; } } else if ($str[$pos] &lt; "\xF0") { $left = "\xE0" === $str[$pos] ? "\xA0" : "\x80"; $right = "\xED" === $str[$pos] ? "\x9F" : "\xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] &lt; $left || $right &lt; $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] &lt; "\x80" || "\xBF" &lt; $str[$pos+2]) { $char_size = 2; } else { $valid = true; $char_size = 3; } } else if ($str[$pos] &lt; "\xF5") { $left = "\xF0" === $str[$pos] ? "\x90" : "\x80"; $right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] &lt; $left || $right &lt; $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] &lt; "\x80" || "\xBF" &lt; $str[$pos+2]) { $char_size = 2; } else if (!isset($str[$pos+3]) || $str[$pos+3] &lt; "\x80" || "\xBF" &lt; $str[$pos+3]) { $char_size = 3; } else { $valid = true; $char_size = 4; } } else { $char_size = 1; } $char = substr($str, $pos, $char_size); $pos += $char_size; return true; } </code></pre> <p>The test case is here.</p> <pre><code>function run(array $callables, array $arguments) { return array_map(function($callable) use($arguments) { return array_map($callable, $arguments); }, $callables); } $data = [ // Table 3-8. Use of U+FFFD in UTF-8 Conversion // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf) "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63" ."\x80"."\xBF"."\x64", // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence "\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C" ]; var_dump(run([ 'replace_invalid_byte_sequence', 'replace_invalid_byte_sequence2', 'replace_invalid_byte_sequence3', 'replace_invalid_byte_sequence4', 'replace_invalid_byte_sequence5', 'replace_invalid_byte_sequence6' ], $data)); </code></pre> <p>As a note, <strong>mb_convert_encoding</strong> has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding <strong>U+FFFD</strong>.</p> <pre><code>$data = [ // U+20AC "\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC", "\xE2\x82" ."\xE2\x82\xAC"."\xE2\x82\xAC", // U+24B62 "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", "\xF0\xA4\xAD" ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", "\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", // 'FULL MOON SYMBOL' (U+1F315) "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C", "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C" ]; </code></pre> <p>Although <strong>preg_match()</strong> can be used intead of <strong>preg_replace_callback</strong>, this function has a limition on bytesize. See bug report <a href="https://bugs.php.net/bug.php?id=36463" rel="noreferrer">#36463</a> for details. You can confirm it by the following test case.</p> <pre><code>str_repeat('a', 10000) </code></pre> <p>Finally, the result of my benchmark is following.</p> <pre><code>mb_convert_encoding() 0.19628190994263 htmlspecialchars() 0.082863092422485 UConverter::transcode() 0.15999984741211 UConverter::convert() 0.29843020439148 preg_replace_callback() 0.63967490196228 direct comparision 0.71933102607727 </code></pre> <p>The benchmark code is here.</p> <pre><code>function timer(array $callables, array $arguments, $repeat = 10000) { $ret = []; $save = $repeat; foreach ($callables as $key =&gt; $callable) { $start = microtime(true); do { array_map($callable, $arguments); } while($repeat -= 1); $stop = microtime(true); $ret[$key] = $stop - $start; $repeat = $save; } return $ret; } $functions = [ 'mb_convert_encoding()' =&gt; 'replace_invalid_byte_sequence', 'htmlspecialchars()' =&gt; 'replace_invalid_byte_sequence2', 'UConverter::transcode()' =&gt; 'replace_invalid_byte_sequence3', 'UConverter::convert()' =&gt; 'replace_invalid_byte_sequence4', 'preg_replace_callback()' =&gt; 'replace_invalid_byte_sequence5', 'direct comparision' =&gt; 'replace_invalid_byte_sequence6' ]; foreach (timer($functions, $data) as $description =&gt; $time) { echo $description, PHP_EOL, $time, PHP_EOL; } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload