StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
13695364
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
2012-12-05T12:29:50.303
CreationDate
2012-12-04T02:54:00.437
FavoriteCount
0
LastActivityDate
2013-09-10T05:06:52.797
LastEditDate
2013-09-10T05:06:52.797
LastEditorUserId
531320
OwnerUserId
531320
ParentId
8215050
PostTypeId
2
Score
30
ViewCount
0
LastEditorDisplayName
text
Body
You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use <a href="https://wiki.php.net/rfc/uconverter" rel="noreferrer">UConverter</a> since PHP 5.5. Recommended substitute character for invalid byte sequence is U+FFFD. see "<a href="http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences" rel="noreferrer">3.1.2 Substituting for Ill-Formed Subsequences</a>" in UTR #36: Unicode Security Considerations for the details. When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F). <pre><code>// REPLACEMENT CHARACTER (U+FFFD) mb_substitute_character(0xFFFD); function replace_invalid_byte_sequence($str) { return mb_convert_encoding($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence2($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); } </code></pre> UConverter offers both procedual and object-oriented API. <pre><code>function replace_invalid_byte_sequence3($str) { return UConverter::transcode($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence4($str) { return (new UConverter('UTF-8', 'UTF-8'))->convert($str); } </code></pre> When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes. <pre><code>lead byte: 0x00 - 0x7F, 0xC2 - 0xF4 trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F) </code></pre> you can refer to the following resources for checking the byte range. <ol> <li>"<a href="http://tools.ietf.org/html/rfc3629#section-4" rel="noreferrer">Syntax of UTF-8 Byte Sequences</a>" in RFC 3629</li> <li>"<a href="http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf" rel="noreferrer">Table 3-7. Well-Formed UTF-8 Byte Sequences</a>" in the Unicode Standard 6.1</li> <li>"<a href="http://www.w3.org/International/questions/qa-forms-utf-8.en.php" rel="noreferrer">Multilingual form encoding</a>" in W3C Internationalization"</li> </ol> The byte range table is the below. <pre><code> Code Points First Byte Second Byte Third Byte Fourth Byte U+0000 - U+007F 00 - 7F U+0080 - U+07FF C2 - DF 80 - BF U+0800 - U+0FFF E0 A0 - BF 80 - BF U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF U+D000 - U+D7FF ED 80 - 9F 80 - BF U+E000 - U+FFFF EE - EF 80 - BF 80 - BF U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF </code></pre> How to replace invalid byte sequence without breaking valid characters is shown in "<a href="http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences" rel="noreferrer">3.1.1 Ill-Formed Subsequences</a>" in UTR #36: Unicode Security Considerations and "<a href="http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf" rel="noreferrer">Table 3-8. Use of U+FFFD in UTF-8 Conversion</a>" in The Unicode Standard. The Unicode Standard shows an example: <pre><code>before: <61 F1 80 80 E1 80 C2 62 80 63 80 BF 64 > after: <0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064> </code></pre> Here is the implementation by preg_replace_callback() according to the above rule. <pre><code>function replace_invalid_byte_sequence5($str) { // REPLACEMENT CHARACTER (U+FFFD) $substitute = "\xEF\xBF\xBD"; $regex = '/ ([\x00-\x7F] # U+0000 - U+007F |[\xC2-\xDF][\x80-\xBF] # U+0080 - U+07FF | \xE0[\xA0-\xBF][\x80-\xBF] # U+0800 - U+0FFF |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # U+1000 - U+CFFF | \xED[\x80-\x9F][\x80-\xBF] # U+D000 - U+D7FF | \xF0[\x90-\xBF][\x80-\xBF]{2} # U+10000 - U+3FFFF |[\xF1-\xF3][\x80-\xBF]{3} # U+40000 - U+FFFFF | \xF4[\x80-\x8F][\x80-\xBF]{2}) # U+100000 - U+10FFFF |(\xE0[\xA0-\xBF] # U+0800 - U+0FFF (invalid) |[\xE1-\xEC\xEE\xEF][\x80-\xBF] # U+1000 - U+CFFF (invalid) | \xED[\x80-\x9F] # U+D000 - U+D7FF (invalid) | \xF0[\x90-\xBF][\x80-\xBF]? # U+10000 - U+3FFFF (invalid) |[\xF1-\xF3][\x80-\xBF]{1,2} # U+40000 - U+FFFFF (invalid) | \xF4[\x80-\x8F][\x80-\xBF]?) # U+100000 - U+10FFFF (invalid) |(.) # invalid 1-byte /xs'; // $matches[1]: valid character // $matches[2]: invalid 3-byte or 4-byte character // $matches[3]: invalid 1-byte $ret = preg_replace_callback($regex, function($matches) use($substitute) { if (isset($matches[2]) || isset($matches[3])) { return $substitute; } return $matches[1]; }, $str); return $ret; } </code></pre> You can compare byte directly and avoid preg_match's restriction about byte size by this way. <pre><code>function replace_invalid_byte_sequence6($str) { $size = strlen($str); $substitute = "\xEF\xBF\xBD"; $ret = ''; $pos = 0; $char; $char_size; $valid; while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) { $ret .= $valid ? $char : $substitute; } return $ret; } function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid) { $valid = false; if ($str_size <= $pos) { return false; } if ($str[$pos] < "\x80") { $valid = true; $char_size = 1; } else if ($str[$pos] < "\xC2") { $char_size = 1; } else if ($str[$pos] < "\xE0") { if (!isset($str[$pos+1]) || $str[$pos+1] < "\x80" || "\xBF" < $str[$pos+1]) { $char_size = 1; } else { $valid = true; $char_size = 2; } } else if ($str[$pos] < "\xF0") { $left = "\xE0" === $str[$pos] ? "\xA0" : "\x80"; $right = "\xED" === $str[$pos] ? "\x9F" : "\xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) { $char_size = 2; } else { $valid = true; $char_size = 3; } } else if ($str[$pos] < "\xF5") { $left = "\xF0" === $str[$pos] ? "\x90" : "\x80"; $right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) { $char_size = 2; } else if (!isset($str[$pos+3]) || $str[$pos+3] < "\x80" || "\xBF" < $str[$pos+3]) { $char_size = 3; } else { $valid = true; $char_size = 4; } } else { $char_size = 1; } $char = substr($str, $pos, $char_size); $pos += $char_size; return true; } </code></pre> The test case is here. <pre><code>function run(array $callables, array $arguments) { return array_map(function($callable) use($arguments) { return array_map($callable, $arguments); }, $callables); } $data = [ // Table 3-8. Use of U+FFFD in UTF-8 Conversion // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf) "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63" ."\x80"."\xBF"."\x64", // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence "\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C" ]; var_dump(run([ 'replace_invalid_byte_sequence', 'replace_invalid_byte_sequence2', 'replace_invalid_byte_sequence3', 'replace_invalid_byte_sequence4', 'replace_invalid_byte_sequence5', 'replace_invalid_byte_sequence6' ], $data)); </code></pre> As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD. <pre><code>$data = [ // U+20AC "\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC", "\xE2\x82" ."\xE2\x82\xAC"."\xE2\x82\xAC", // U+24B62 "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", "\xF0\xA4\xAD" ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", "\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", // 'FULL MOON SYMBOL' (U+1F315) "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C", "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C" ]; </code></pre> Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report <a href="https://bugs.php.net/bug.php?id=36463" rel="noreferrer">#36463</a> for details. You can confirm it by the following test case. <pre><code>str_repeat('a', 10000) </code></pre> Finally, the result of my benchmark is following. <pre><code>mb_convert_encoding() 0.19628190994263 htmlspecialchars() 0.082863092422485 UConverter::transcode() 0.15999984741211 UConverter::convert() 0.29843020439148 preg_replace_callback() 0.63967490196228 direct comparision 0.71933102607727 </code></pre> The benchmark code is here. <pre><code>function timer(array $callables, array $arguments, $repeat = 10000) { $ret = []; $save = $repeat; foreach ($callables as $key => $callable) { $start = microtime(true); do { array_map($callable, $arguments); } while($repeat -= 1); $stop = microtime(true); $ret[$key] = $stop - $start; $repeat = $save; } return $ret; } $functions = [ 'mb_convert_encoding()' => 'replace_invalid_byte_sequence', 'htmlspecialchars()' => 'replace_invalid_byte_sequence2', 'UConverter::transcode()' => 'replace_invalid_byte_sequence3', 'UConverter::convert()' => 'replace_invalid_byte_sequence4', 'preg_replace_callback()' => 'replace_invalid_byte_sequence5', 'direct comparision' => 'replace_invalid_byte_sequence6' ]; foreach (timer($functions, $data) as $description => $time) { echo $description, PHP_EOL, $time, PHP_EOL; } </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POReplacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USmasakielastic
UserOwnerUserId
1. USmasakielastic
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POReplacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.