Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You say:</p> <blockquote> <p>The character that causes the problem is: å, </p> </blockquote> <p>How do you know that? What are you viewing your text with?</p> <p>So you can't publish the URL and your API key; what about reading the data, writing it to a file (in binary mode), and publishing that? </p> <p>When you open that file in your web browser, what encoding does it detect?</p> <p>At the very least, do this</p> <pre><code>data.decode('utf8') # where data is what you get from ur.read() </code></pre> <p>This will produce an exception that will tell you the byte offset of the non-UTF-8 stuff.</p> <p>Then do this:</p> <p><code>print repr(data[offset-10:offset+60])</code></p> <p>and show us the results.</p> <p>Assuming the encoding is actually <code>cp1252</code> and decoding the bytes in the lxml error message:</p> <pre><code>&gt;&gt;&gt; guff = "\xEA\x76\x65\x73" &gt;&gt;&gt; from unicodedata import name &gt;&gt;&gt; [name(c) for c in guff.decode('1252')] ['LATIN SMALL LETTER E WITH CIRCUMFLEX', 'LATIN SMALL LETTER V', 'LATIN SMALL LE TTER E', 'LATIN SMALL LETTER S'] &gt;&gt;&gt; </code></pre> <p>So are you seeing e-circumflex followed by <code>ves</code>, or a-ring followed by <code>ves</code>, or a-ring followed by something else?</p> <p>Does the data start with an XML declaration like <code>&lt;?xml version="1.0" encoding="UTF-8"?&gt;</code>? If not, what does it start with?</p> <p>Clues for encoding guessing/confirmation: What language is the text written in? What country?</p> <p><strong>UPDATE</strong> based on further information supplied.</p> <p>Based on the snippet that you showed in the vicinity of the error, the movie title is "La science des rêves" (the science of dreams). </p> <p>Funny how PHP gags on "F***ing Åmål" but Python chokes on French dreams. Are you sure that you did the same query?</p> <p>You should have told us it was IMDB up front, you would have got your answer much sooner. </p> <p><strong>SOLUTION</strong> before you pass <code>data</code> to the <code>lxml</code> parser, do this:</p> <pre><code>data = data.replace('encoding="UTF-8"', 'encoding="iso-8859-1"') </code></pre> <p>That's based on the encoding that they declare on their website, but that may be a lie too. In that case, try <code>cp1252</code> instead. It's definitely <strong>not iso-8859-2</strong>.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload