Note that there are some explanatory texts on larger screens.

plurals
  1. POCrawlspider not scraping anything
    primarykey
    data
    text
    <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor #scrapy crawl dmoz -o items.json -t json from scrapy.http import Request from urlparse import urlparse from manga.items import MangaItem class MangaHere(CrawlSpider): name = "mangahs" allowed_domains = ["mangahere.com"] start_urls = ["http://www.mangahere.com/seinen/"] rules = [Rule(SgmlLinkExtractor(allow = ('//a[@class="next"]')), follow=True, callback='parse_item'),] #def parse(self, response): # get indext depth for every page # hxs = HtmlXPathSelector(response) # next_link = hxs.select('//a[@class="next"]') # index_depth = int(next_link.select('preceding-sibling::a[1]/text()').extract()[0]) #create a request for the first page # url = urlparse("http://www.mangahere.com/seinen/") # yield Request(url.geturl(), callback=self.parse_item) #create a request for each subsequent page in the form "./seinen/x.html" # for x in xrange(2, index_depth): # pageURL = "http://www.mangahere.com/seinen/%s.htm" % x # url = urlparse(pageURL) # yield Request(url.geturl(), callback=self.parse_item) def parse_start_url(self, response): list(self.parse_item(response)) def parse_item(self,response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li/div') items = [] for site in sites: rating = site.select("p/span/text()").extract() desc = site.select("p[2]/text()").extract() for i in rating: for p in desc: if float(i) &gt; 4.8 and "ecchi" not in str(p): item = MangaItem() item['title'] = site.select("div/a/text()").extract() item['link'] = site.select("div/a/@href").extract() item['desc'] = site.select("p[2]/text()").extract() item['rate'] = site.select("p/span/text()").extract() items.append(item) return items </code></pre> <p>The commented stuff is a method of crawling the pages without a crawlspider that someone here helped me with. But I still want to learn how to make the crawlspider work for the sake of knowing. </p> <p>I get no errors but it scrapes 0 pages, I checked a lot of threads and it sounded like I had to add a <code>parse_start_url</code> for some reason but that didn't help, changing the names of the parse function didn't help either. </p> <p>What is not working? Is my rule incorrect or am I missing something?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload