Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Add the <code>u</code> flag to your regex. This makes the RegEx engine treat the input string as UTF-8.</p> <pre><code>$keywords = preg_replace("@[  ]@u", ' ',urldecode($keywords)); // outputs :'ラメ単色' </code></pre> <p><a href="http://codepad.viper-7.com/0kmBb4" rel="nofollow">CodePad</a>.</p> <p>The reason it mangles the string is because to the RegEx engine, your replacement characters, <code>20</code> (space) or <code>e3 80 80</code> (IDEOGRAPHIC SPACE) are not treated as two characters, but separate bytes <code>20</code>, <code>e3</code> and <code>80</code>.</p> <p>When you look at the byte sequence of your string to scan, we get <code>e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2</code>. We know the first character is a IDEOGRAPHIC SPACE, but because PHP is treating it as a sequence of bytes, it does a replacement individually of the first four bytes, because they match individual bytes that the regex engine is scanning.</p> <p>As for the mangling which results in the � (REPLACEMENT CHARACTER), we can see this happens because the byte <code>e3</code> is present further along in the string. The <code>e3</code> byte is the start byte of a three byte long Japanese character, such as <code>e3 83 a9</code> (KATAKANA LETTER RA). When that leading <code>e3</code> is replaced with a <code>20</code> (space), it no longer becomes a valid UTF-8 sequence.</p> <p>When you enable the <code>u</code> flag, the RegEx engine treats the string as UTF-8, and won't treat your characters in your character class on a per-byte basis.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload