Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The problem with UTF-8 is that it is a multibyte encoding. As such it needs a way to indicate when a character is formed by more than one byte (maybe two, three, four, ...). The way of doing this is by reserving some byte values to signal multibyte characters. Thus encoding follows some basic rules:</p> <ul> <li>One byte characters have no MSB set (codes compatible with 7-bit ASCII).</li> <li>Two byte characters are represented by sequence: 110xxxxx 10xxxxxx</li> <li>Three bytes: 1110xxxx 10xxxxxx 10xxxxxx</li> <li>Four bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx </li> </ul> <p>Your problem is that you may be reading some character string <strong>supposedly</strong> encoded as UTF-8 (as the XML encoding definition states) but the byte chunk might not be really encoded in UTF-8 (it is a common mistake to declare something as UTF-8 but encoding text with a different encoding such as Cp1252). Your XML parser tries to interpret byte chunks as UTF-8 characters but finds something that does not fit the encoding rules (illegal character). I.e. two bytes with two most significate bytes set would bring an illegal encoding error: 110xxxxx must be always followed by 10xxxxxx (values such as 01xxxxxx 11xxxxxx 00xxxxxx would be illegal).</p> <p>This problem does not arise when non-variable length encodings are used. I.e. if you state in your XML declaration that your file uses Windows-1252 encoding but you end up using ANSI your only problem will be that non-ASCII characters (values > 127) will render incorrectly.</p> <hr> <p>The solution:</p> <ol> <li><em>Try to detect encoding by other means</em>. <ul> <li>If you will always be reading data from same source you could sample some files and use an advanced text editor that tries to infer <strong>actual</strong> encoding of the file (i.e. <a href="http://notepad-plus.sourceforge.net/es/site.htm" rel="nofollow noreferrer">notepad++</a>, <a href="http://www.jedit.org/" rel="nofollow noreferrer">jEdit</a>, etc.).</li> <li>Do it programatically. Preprocess raw bytes before doing any actual xml processing.</li> </ul></li> <li><em>Force <strong>actual</strong> encoding at the XML processor</em></li> </ol> <p>Alternatively if you do not mind about non-ASCII characters (no matter if strange symbols appear now and then) you could go directly to step 2 and <strong>force XML processing to any ASCII compatible 8-byte fixed length encoding</strong> (ANSI, any Windows-XXXX codepage, Mac-Roman encoding, etc.). With your present code you just could try:</p> <pre><code>XmlPullParser parser = Xml.newPullParser(); parser.setInput(url.open(), "ISO-8859-1"); </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload