Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to connect to https site with Scrapy via Polipo over TOR?
    primarykey
    data
    text
    <p>Not entirely sure what the problem is here.</p> <p>Running Python 2.7.3, and Scrapy 0.16.5</p> <p>I've created a very simple Scrapy spider to test connecting to my local Polipo proxy so I can send requests out via TOR. Basic code of my spider is as follows:</p> <pre><code>from scrapy.spider import BaseSpider class TorSpider(BaseSpider): name = "tor" allowed_domains = ["check.torproject.org"] start_urls = [ "https://check.torproject.org" ] def parse(self, response): print response.body </code></pre> <p>For my proxy middleware, I've defined:</p> <pre><code>class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = settings.get('HTTP_PROXY') </code></pre> <p>My HTTP_PROXY in my settings file is defined as <code>HTTP_PROXY = 'http://localhost:8123'</code>.</p> <p>Now, if I change my start URL to <a href="http://check.torproject.org">http://check.torproject.org</a>, everything works fine, no problems.</p> <p>If I attempt to run against <a href="https://check.torproject.org">https://check.torproject.org</a>, I get a 400 Bad Request error every time (I've also tried different https:// sites, and all of them have the same problem):</p> <pre><code>2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid) 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines: 2013-07-23 21:36:18+0100 [tor] INFO: Spider opened 2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying &lt;GET https://check.torproject.org&gt; (failed 1 times): 400 Bad Request 2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying &lt;GET https://check.torproject.org&gt; (failed 2 times): 400 Bad Request 2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying &lt;GET https://check.torproject.org&gt; (failed 3 times): 400 Bad Request 2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) &lt;GET https://check.torproject.org&gt; (referer: None) 2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished) </code></pre> <p>And just to double check that it isn't something wrong with my TOR/Polipo set up, I'm able to run the following curl command in a terminal, and connect fine: <code>curl --proxy localhost:8123 https://check.torproject.org/</code></p> <p>Any suggestions as to what's wrong here?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload