Note that there are some explanatory texts on larger screens.

plurals
  1. POIgnore encoding errors in Python (iterparse)?
    primarykey
    data
    text
    <p>I've been fighting with this for an hour now. I'm parsing an XML-string with <code>iterparse</code>. However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding.</p> <p>Here's the error I get:</p> <pre><code>lxml.etree.XMLSyntaxError: line 8167: Input is not proper UTF-8, indicate encoding ! Bytes: 0xEA 0x76 0x65 0x73 </code></pre> <p>How can I simply ignore this error and still continue on parsing? I don't mind, if one character is not saved properly, I just need the data.</p> <p>Here's what I've tried, all picked from internet:</p> <pre><code>data = data.encode('UTF-8','ignore') data = unicode(data,errors='ignore') data = unicode(data.strip(codecs.BOM_UTF8), 'utf-8', errors='ignore') </code></pre> <p><strong>Edit:</strong><br> I can't show the url, as it's a private API and involves my API key, but this is how I obtain the data:</p> <pre><code>ur = urlopen(url) data = ur.read() </code></pre> <p>The character that causes the problem is: <code>å</code>, I guess that <code>ä</code> &amp; <code>ö</code>, etc, would also break it.</p> <p>Here's the part where I try to parse it:</p> <pre><code>def fast_iter(context, func): for event, elem in context: func(elem) elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] del context def process_element(elem): print elem.xpath('title/text( )') context = etree.iterparse(StringIO(data), tag='item') fast_iter(context, process_element) </code></pre> <p><strong>Edit 2:</strong><br> <a href="http://img.qo.fi/zlmb35.png" rel="nofollow">This</a> is what happens, when I try to parse it in PHP. Just to clarify, F***ing Åmål is a <a href="http://www.imdb.com/title/tt0150662/" rel="nofollow">drama movie</a> =D</p> <p>The file starts with <code>&lt;?xml version="1.0" encoding="UTF-8" ?&gt;</code></p> <p>Here's what I get from <code>print repr(data[offset-10:offset+60])</code>:</p> <pre><code>ence des r\xeaves, La&lt;/title&gt;\n\t\t&lt;year&gt;2006&lt;/year&gt;\n\t\t&lt;imdb&gt;0354899&lt;/imdb&gt;\n </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload