Note that there are some explanatory texts on larger screens.

plurals
  1. PONot able to follow links using Scrapy
    text
    copied!<p>I've created a spider that extends CrawlSpider and followed the advice at <a href="http://scrapy.readthedocs.org/en/latest/topics/spiders.html" rel="nofollow">http://scrapy.readthedocs.org/en/latest/topics/spiders.html</a></p> <p>The problem is that I need to parse both the start url (which happens to coincide with the hostname) and some links that it cointains.</p> <p>So I've defined a rule like: <code>rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)]</code>, but nothing happens.</p> <p>Then I've tried to define a set of rules like: <code>rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]</code>. The problem now is that the spider parses everything. </p> <p>How can I tell the spider to parse the _start_url_ and only some links that it includes?</p> <p><strong>Update:</strong></p> <p>I've tried to override the <code>parse_start_url</code> method, so now I'm able to get data from the start page, but it still doesn't follow links defined with a <code>Rule</code>:</p> <pre><code>class ExampleSpider(CrawlSpider): name = 'TechCrunchCrawler' start_urls = ['http://techcrunch.com'] allowed_domains = ['techcrunch.com'] rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)] def parse_start_url(self, response): print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' return self.parse_links(response) def parse_links(self, response): print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' articles = [] for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'): article = Article() article['title'] = i.select('./@title').extract() article['link'] = i.select('./@href').extract() articles.append(article) return articles </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload