Note that there are some explanatory texts on larger screens.

plurals
  1. POScrapy only scraping first result of each page
    primarykey
    data
    text
    <p>I'm currently trying to run the following code but it keeps scraping only the first result of each page. Any idea what the issue may be?</p> <pre><code>from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from firstproject.items import xyz123Item import urlparse from scrapy.http.request import Request class MySpider(CrawlSpider): name = "xyz123" allowed_domains = ["www.xyz123.com.au"] start_urls = ["http://www.xyz123.com.au/",] rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//*[@id="1234headerPagination_hlNextLink"]',)) , callback="parse_xyz", follow=True), ) def parse_xyz(self, response): hxs = HtmlXPathSelector(response) xyz = hxs.select('//div[@id="1234SearchResults"]//div/h2') items = [] for xyz in xyz: item = xyz123Item() item ["title"] = xyz.select('a/text()').extract()[0] item ["link"] = xyz.select('a/@href').extract()[0] items.append(item) return items </code></pre> <p>The Basespider version works well scraping ALL the required data on the first page: </p> <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from firstproject.items import xyz123 class MySpider(BaseSpider): name = "xyz123test" allowed_domains = ["xyz123.com.au"] start_urls = ["http://www.xyz123.com.au/"] def parse(self, response): hxs = HtmlXPathSelector(response) titles = hxs.select('//div[@id="1234SearchResults"]//div/h2') items = [] for titles in titles: item = xyz123Item() item ["title"] = titles.select("a/text()").extract() item ["link"] = titles.select("a/@href").extract() items.append(item) return items </code></pre> <p>Sorry for the censoring. I had to censor the website for privacy reasons.</p> <p>The first code crawls through the pages well the way I'd like it to crawl, however it only pulls the first item title and link. NOTE: The XPath of the first title using "inspect element" in google is:<br> <code>//*[@id="xyz123SearchResults"]/div[1]/h2/a</code>,<br> second is <code>//*[@id="xyz123SearchResults"]/div[2]/h2/a</code> <br> third is <code>//*[@id="xyz123SearchResults"]/div[3]/h2/a</code> etc.</p> <p>I'm not sure if the div[n] bit is what's killing it. I'm hoping it's an easy fix.</p> <p>Thanks</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload