Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>When working with UTF-8 data, always use the <a href="http://php.net/reference.pcre.pattern.modifiers"><em>u</em> modifier</a> in your patterns:</p> <pre><code>/\s/u </code></pre> <p>Because otherwise the pattern is not interpreted as UTF-8.</p> <p>Like in this case the character <code>נ</code> (U+05E0) is encoded with 0xD7A0 in UTF-8. And <code>\s</code> represents any whitespace character (according to <a href="http://pcre.org/pcre.txt">PCRE</a>):</p> <blockquote> <p>The <code>\s</code> characters are HT (9), LF (10), FF (12), CR (13), and space (32).</p> </blockquote> <p>When UTF-8 support was added, they have also added a special option called PCRE_UCP to have <code>\b</code>, <code>\d</code>, <code>\s</code>, and <code>\w</code> not just match US-ASCII characters but also other Unicode characters by their Unicode properties:</p> <blockquote> <p>By default, in UTF-8 mode, characters with values greater than 128 never match <code>\d</code>, <code>\s</code>, or <code>\w</code>, and always match <code>\D</code>, <code>\S</code>, and <code>\W</code>. […] However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:</p> <ul> <li><code>\d</code> any character that <code>\p{Nd}</code> matches (decimal digit)</li> <li><code>\s</code> any character that <code>\p{Z}</code> matches, plus HT, LF, FF, CR</li> <li><code>\w</code> any character that <code>\p{L}</code> or <code>\p{N}</code> matches, plus underscore</li> </ul> </blockquote> <p>And that non-breaking space U+00A0 has the property of a separator (<code>\p{Z}</code>).</p> <p>So although your pattern is not in UTF-8 mode, it seems that <code>\s</code> <em>does</em> match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent to <code>array("\xD7", "")</code>.</p> <p>And that’s obviously a bug as the pattern is <em>not</em> in UTF-8 mode but 0xA0 <em>is</em> greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The <a href="http://bugs.php.net/bug.php?id=52971">bug #52971 <em>PCRE-Meta-Characters not working with utf-8</em></a> could be related with this.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload