StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POscrapy crawling just 1 level of a web-site
primarykey
Id
9406895
data
AcceptedAnswerId
9474369
AnswerCount
3
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2012-02-23T03:50:13.553
FavoriteCount
2
LastActivityDate
2015-02-01T10:46:56.700
LastEditDate
2017-05-23T11:52:02.740
LastEditorUserId
-1
OwnerUserId
929701
ParentId
0
PostTypeId
1
Score
3
ViewCount
5878
LastEditorDisplayName
text
Body
I am using scrapy to crawl all the web pages under a domain. I have seen <a href="https://stackoverflow.com/questions/8381082/scrapy-not-crawling-all-the-pages">this</a> question. But there is no solution. My problem seems to be similar one. My output of crawl command looks like this: <pre><code>scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler) 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened 2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None) 2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished) 2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats: {'downloader/request_bytes': 198, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 11000, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155), 'scheduler/memory_enqueued': 1, 'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)} 2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished) 2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats: {'memusage/max': 29663232, 'memusage/startup': 29663232} </code></pre> Problem here is the crawl finds links from first page, but does not visit them. Whats the use of such a crawler. EDIT: My crawler code is: <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class SjsuSpider(BaseSpider): name = "sjsu" allowed_domains = ["sjsu.edu"] start_urls = [ "http://cs.sjsu.edu/" ] def parse(self, response): filename = "sjsupages" open(filename, 'wb').write(response.body) </code></pre> All of my other settings are default.
Tags
<python><web-crawler><scrapy>
Title
scrapy crawling just 1 level of a web-site
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USriship89
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POscrapy crawling just 1 level of a web-site
 UserUserId
 USTomi Toivio
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 POscrapy crawling just 1 level of a web-site
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POscrapy crawling just 1 level of a web-site
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COCan you show `Spider` code and `Rules`?
 singulars
 PostPostId
 POscrapy crawling just 1 level of a web-site
 UserUserId
 USreclosedev
2. CO@reclosedev: added details.
 singulars
 PostPostId
 POscrapy crawling just 1 level of a web-site
 UserUserId
 USriship89

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.