StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POneed design suggestions for an efficient webcrawler that is going to parse 8M pages - Python
primarykey
Id
6751306
data
AcceptedAnswerId
6799269
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2011-07-19T17:22:11.757
FavoriteCount
1
LastActivityDate
2011-07-26T04:00:53.030
LastEditDate
2011-07-19T17:38:35.870
LastEditorUserId
825597
OwnerUserId
825597
ParentId
0
PostTypeId
1
Score
4
ViewCount
401
LastEditorDisplayName
text
Body
I'm going to develop a little crawler that's going to fetch a lot of pages from the same website, all of the requests are a change in the ID number of the url. I need to save all of data I'll parse into a csv (nothing fancy), at most, I will crawl about 6M-8M pages, most of them doesn't contain the data I want, I know that there are about 400K pages which I need to parse, they are all similar in structure, I can't avoid crawl all the urls. that's how the page looks when I get the data - <a href="http://pastebin.com/3DYPhPRg" rel="nofollow">http://pastebin.com/3DYPhPRg</a> that's when I don't get the data - <a href="http://pastebin.com/YwxXAmih" rel="nofollow">http://pastebin.com/YwxXAmih</a> the data is saved in the spans inside the td's - <pre><code>I need the data between ">" and "". 520000472</td> חברת החשמל לישראל בעמ</td> פעילה</td> חברה ציבורית</td> חברה ממשלתית</td> מוגבלת</td> etc' </code></pre> that's nothing too hard to parse from the document. the problem is that it will take a few days to fetch the urls and parse them, it will consume a lot of memory and I think that it's going to crash here and then, which is very dangerous for me, it can't crash unless it can't run anymore. I thought about - <pre><code> - fetching a url (urllib2) - if there's an error - move next (if it'll happen 5 times - I stop and save errors to log) - parse the html (still don't know whats best - BeautifulSoup \ lxml \ scrapy \ HTMLParser etc') - if it's empty (lblCompanyNumber will be empty) save the ID in the emptyCsvFile.csv - else: save the data to goodResults.csv </code></pre> the questions are - <ol> <li>which data types should I use in order to be more efficient and quick (for the data I parse and for the fetched content)?</li> <li>which HTML parsing library should I use? maybe regex? the span id is fixed and doesn't change when there's data (again, efficient, speed, simplicity)</li> <li>saving to file, keeping an handle to the file for so long etc' - is there a way that will take less resources and will be more efficient to save save the data? (400K lines at least)</li> <li>any other thing I haven't thought about and I need to deal with, and maybe some optimization tips :)</li> </ol> another solution I thought of is using wget, save all pages to disk and then delete all the files who has the same md5sum of an empty document, the only problem is that I'm not saving the empty IDs. by the way, I need to use py2exe and make an exe out of it, so things like scrapy can be hard to use here (it's known to cause issues with py2exe). Thanks!
Tags
<python><design><html-parsing><web-crawler>
Title
need design suggestions for an efficient webcrawler that is going to parse 8M pages - Python
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USYSY
UserOwnerUserId
1. USYSY
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POneed design suggestions for an efficient webcrawler that is going to parse 8M pages - Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POneed design suggestions for an efficient webcrawler that is going to parse 8M pages - Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POneed design suggestions for an efficient webcrawler that is going to parse 8M pages - Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.