Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>CrawlSpider inherits BaseSpider. It just added rules to extract and follow links. If these rules are not enough flexible for you - use BaseSpider:</p> <pre><code>class USpider(BaseSpider): """my spider. """ start_urls = ['http://www.amazon.com/s/?url=search-alias%3Dapparel&amp;sort=relevance-fs-browse-rank'] allowed_domains = ['amazon.com'] def parse(self, response): '''Parse main category search page and extract subcategory search link.''' self.log('Downloaded category search page.', log.DEBUG) if response.meta['depth'] &gt; 5: self.log('Categories depth limit reached (recursive links?). Stopping further following.', log.WARNING) hxs = HtmlXPathSelector(response) subcategories = hxs.select("//div[@id='refinements']/*[starts-with(.,'Department')]/following-sibling::ul[1]/li/a[span[@class='refinementLink']]/@href").extract() for subcategory in subcategories: subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink) yield Request(subcategorySearchLink, callback = self.parseSubcategory) def parseSubcategory(self, response): '''Parse subcategory search page and extract item links.''' hxs = HtmlXPathSelector(response) for itemLink in hxs.select('//a[@class="title"]/@href').extract(): itemLink = urlparse.urljoin(response.url, itemLink) self.log('Requesting item page: ' + itemLink, log.DEBUG) yield Request(itemLink, callback = self.parseItem) try: nextPageLink = hxs.select("//a[@id='pagnNextLink']/@href").extract()[0] nextPageLink = urlparse.urljoin(response.url, nextPageLink) self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG) yield Request(nextPageLink, callback = self.parseSubcategory) except: self.log('Whole category parsed: ' + categoryPath, log.DEBUG) def parseItem(self, response): '''Parse item page and extract product info.''' hxs = HtmlXPathSelector(response) item = UItem() item['brand'] = self.extractText("//div[@class='buying']/span[1]/a[1]", hxs) item['title'] = self.extractText("//span[@id='btAsinTitle']", hxs) ... </code></pre> <p>Even if BaseSpider's start_urls are not enough flexible for you, override <a href="http://doc.scrapy.org/topics/spiders.html#scrapy.spider.BaseSpider.start_requests" rel="noreferrer">start_requests</a> method.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload