Note that there are some explanatory texts on larger screens.

plurals
  1. POPhp cannot find way to split utf-8 strings
    text
    copied!<p>i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.</p> <p>I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim <code>:set encoding</code> confirms this) which i then proceed to reading it using</p> <pre><code>$file = fopen("file.txt", "r"); while(!feof($file)){ $line = fgets($file); //... } fclose($file); </code></pre> <ul> <li>using <code>mb_detect_encoding($line)</code> reports <code>UTF-8</code></li> <li>If i do <code>echo $line</code> I can see the line properly (no mangled characters) in the browser <ul> <li>so I guess everything is fine with browser and apache. Though i did search my apache configuration for <a href="http://httpd.apache.org/docs/2.0/mod/core.html#AddDefaultCharset" rel="nofollow">AddDefaultCharset</a> and tried adding http meta-tags for character encoding (just in case)</li> </ul></li> </ul> <p>When i try to split the string using <code>$arr = mb_split(';',$line)</code> the fields of the resulting array contain mangled utf-8 characters (<code>mb_detect_encoding($arr[0])</code> reports utf-8 as well).</p> <p>So <code>echo $arr[0]</code> will result in something like this: <code>ΑΘΗÎÎ</code>.</p> <p>I have tried setting <code>mb_detect_order('utf-8')</code>, <code>mb_internal_encoding('utf-8')</code>, but nothing changed. I also tried to manually detect utf-8 using <a href="http://www.w3.org/International/questions/qa-forms-utf-8.en.php" rel="nofollow">this w3 perl regex</a> because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.</p> <p>So my question is how can i properly split the string? Is going down the <code>mb_</code> path the wrong way? What am I missing?</p> <p>Thank you for your help!</p> <p><strong>UPDATE</strong>: I'm adding sample strings and base64 equivalents (thanks to @chris' for his suggestion)</p> <pre><code>1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889" 2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5" 3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ==" 4. first part ($arr[0] after splitting): "ΑΘΗÎΑ" 5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ==" </code></pre> <p>Ok, so after doing this there seems to be a <code>77u/</code> difference between 3. and 5. which <a href="http://www.roundcubeforum.net/5-release-support/17-pending-issues/8258-strange-base64-utf-8-behaviour-umlaute.html#post34096" rel="nofollow">according to this</a> is a utf-8 BOM mark. So how can i avoid it?</p> <p><strong>UPDATE 2</strong>: I woke up refreshed today and with your tips in mind i tried it again. It seems that <code>$line=fgets($file)</code> reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i <code>base64_encoded</code> the first and second line, and the <code>77u/</code> bom appeared on the base64'd string <em>of the first line only</em>. I then opened up the offending file in vim, and entered <code>:set nobomb</code> <code>:w</code> to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on @hakre's <code>remove_utf8_bom</code> i added it's complementary function</p> <pre><code>function add_utf8_bom($str){ $bom= "\xEF\xBB\xBF"; return substr($str,0,3)===$bom?$str:$bom.$str; } </code></pre> <p>and <em>voila</em> each line is read correctly now.</p> <p>I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.</p> <p>Thanks to @chris, @hakre and @jacob for their time!</p> <p><strong>UPDATE 3 (solution)</strong>: It turns out after all that it was a browser thing: it was not enough to add <code>header('Content-type: text/html; charset=UTF-8')</code> and meta-tags like <code>&lt;meta http-equiv="Content-type" value="text/html; charset=UTF-8" /&gt;</code>. It also had to be properly enclosed inside an <code>&lt;html&gt;&lt;body&gt;</code> section or the browser would not understand the encoding correctly. Thanks to @jake for his suggestion.</p> <p>Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload