Note that there are some explanatory texts on larger screens.

plurals
  1. POScrapy spider: dealing with pages that have incorrectly-defined character encoding
    primarykey
    data
    text
    <p><strong>Update:</strong> this error can be reproduced simply by running this from the command line:</p> <pre><code>scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future </code></pre> <hr> <p>I'm using Scrapy to crawl a website. Every page I scrape claims to be encoded UTF-8:</p> <pre><code>&lt;meta content="text/html; charset=utf-8" http-equiv="Content-Type"&gt; </code></pre> <p>But occasionally, the pages contain bytes that fall outside of UTF-8, and I get Scrapy errors like:</p> <pre><code>exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte </code></pre> <p>I still need to scrape these pages, even though they contain unmappable characters. Is there a way to tell Scrapy to override the page's declared encoding, and use another (say, UTF-16) instead?</p> <p>Here's where the exception is being caught:</p> <pre><code>2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing &lt;GET http://www.site.com/page&gt; Traceback (most recent call last): File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop self.runUntilCurrent() File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent call.func(*call.args, **call.kw) File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback self._startRunCallbacks(result) File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks self._runCallbacks() --- &lt;exception caught here&gt; --- File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output result = method(response=response, result=result, spider=spider) </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload