StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to connect to https site with Scrapy via Polipo over TOR?
primarykey
Id
17820824
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-07-23T20:51:36.337
FavoriteCount
1
LastActivityDate
2017-06-01T14:51:25.413
LastEditDate
LastEditorUserId
0
OwnerUserId
672919
ParentId
0
PostTypeId
1
Score
8
ViewCount
2517
LastEditorDisplayName
text
Body
Not entirely sure what the problem is here. Running Python 2.7.3, and Scrapy 0.16.5 I've created a very simple Scrapy spider to test connecting to my local Polipo proxy so I can send requests out via TOR. Basic code of my spider is as follows: <pre><code>from scrapy.spider import BaseSpider class TorSpider(BaseSpider): name = "tor" allowed_domains = ["check.torproject.org"] start_urls = [ "https://check.torproject.org" ] def parse(self, response): print response.body </code></pre> For my proxy middleware, I've defined: <pre><code>class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = settings.get('HTTP_PROXY') </code></pre> My HTTP_PROXY in my settings file is defined as <code>HTTP_PROXY = 'http://localhost:8123'</code>. Now, if I change my start URL to <a href="http://check.torproject.org">http://check.torproject.org</a>, everything works fine, no problems. If I attempt to run against <a href="https://check.torproject.org">https://check.torproject.org</a>, I get a 400 Bad Request error every time (I've also tried different https:// sites, and all of them have the same problem): <pre><code>2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid) 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines: 2013-07-23 21:36:18+0100 [tor] INFO: Spider opened 2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request 2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request 2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request 2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None) 2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished) </code></pre> And just to double check that it isn't something wrong with my TOR/Polipo set up, I'm able to run the following curl command in a terminal, and connect fine: <code>curl --proxy localhost:8123 https://check.torproject.org/</code> Any suggestions as to what's wrong here?
Tags
<python><scrapy><tor>
Title
How to connect to https site with Scrapy via Polipo over TOR?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USCraig Sefton
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POHow to connect to https site with Scrapy via Polipo over TOR?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POHow to connect to https site with Scrapy via Polipo over TOR?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POHow to connect to https site with Scrapy via Polipo over TOR?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COWhat is your https_proxy set to? HTTP and HTTPS are typically sent over different ports, etc, and will require different proxys.
 singulars
 PostPostId
 POHow to connect to https site with Scrapy via Polipo over TOR?
 UserUserId
 USAndenthal
2. CONot sure I follow. Surely connecting to an HTTP proxy, which in turn connects to an HTTPS URL, should work fine? Why would I have to connect to an HTTPS proxy to connect to an HTTPS URL? If that was the case, wouldn't the above cURL command fail?
 singulars
 PostPostId
 POHow to connect to https site with Scrapy via Polipo over TOR?
 UserUserId
 USCraig Sefton

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.