Note that there are some explanatory texts on larger screens.

plurals
  1. POscrapy crawling just 1 level of a web-site
    primarykey
    data
    text
    <p>I am using scrapy to crawl all the web pages under a domain.</p> <p>I have seen <a href="https://stackoverflow.com/questions/8381082/scrapy-not-crawling-all-the-pages">this</a> question. But there is no solution. My problem seems to be similar one. My output of crawl command looks like this:</p> <pre><code>scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler) 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened 2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) &lt;GET http://cs.sjsu.edu/&gt; (referer: None) 2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished) 2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats: {'downloader/request_bytes': 198, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 11000, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155), 'scheduler/memory_enqueued': 1, 'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)} 2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished) 2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats: {'memusage/max': 29663232, 'memusage/startup': 29663232} </code></pre> <p>Problem here is the crawl finds links from first page, but does not visit them. Whats the use of such a crawler.</p> <p><strong>EDIT</strong>:</p> <p>My crawler code is:</p> <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class SjsuSpider(BaseSpider): name = "sjsu" allowed_domains = ["sjsu.edu"] start_urls = [ "http://cs.sjsu.edu/" ] def parse(self, response): filename = "sjsupages" open(filename, 'wb').write(response.body) </code></pre> <p>All of my other settings are default.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload