Note that there are some explanatory texts on larger screens.

plurals
  1. POneed design suggestions for an efficient webcrawler that is going to parse 8M pages - Python
    primarykey
    data
    text
    <p>I'm going to develop a little crawler that's going to fetch a lot of pages from the same website, all of the requests are a change in the ID number of the url.</p> <p>I need to save all of data I'll parse into a csv (nothing fancy), at most, I will crawl about 6M-8M pages, most of them doesn't contain the data I want, I know that there are about 400K pages which I need to parse, they are all similar in structure, I can't avoid crawl all the urls.</p> <p>that's how the page looks when I get the data - <a href="http://pastebin.com/3DYPhPRg" rel="nofollow">http://pastebin.com/3DYPhPRg</a></p> <p>that's when I don't get the data - <a href="http://pastebin.com/YwxXAmih" rel="nofollow">http://pastebin.com/YwxXAmih</a></p> <p>the data is saved in the spans inside the td's - </p> <pre><code>I need the data between "&gt;" and "&lt;/span&gt;". &lt;span id="lblCompanyNumber"&gt;520000472&lt;/span&gt;&lt;/td&gt; &lt;span id="lblCompanyNameHeb"&gt;חברת החשמל לישראל בעמ&lt;/span&gt;&lt;/td&gt; &lt;span id="lblStatus"&gt;פעילה&lt;/span&gt;&lt;/td&gt; &lt;span id="lblCorporationType"&gt;חברה ציבורית&lt;/span&gt;&lt;/td&gt; &lt;span id="lblGovCompanyType"&gt;חברה ממשלתית&lt;/span&gt;&lt;/td&gt; &lt;span id="lblLimitType"&gt;מוגבלת&lt;/span&gt;&lt;/td&gt; etc' </code></pre> <p>that's nothing too hard to parse from the document.</p> <p>the problem is that it will take a few days to fetch the urls and parse them, it will consume a lot of memory and I think that it's going to crash here and then, which is very dangerous for me, it can't crash unless it can't run anymore.</p> <p>I thought about -</p> <pre><code> - fetching a url (urllib2) - if there's an error - move next (if it'll happen 5 times - I stop and save errors to log) - parse the html (still don't know whats best - BeautifulSoup \ lxml \ scrapy \ HTMLParser etc') - if it's empty (lblCompanyNumber will be empty) save the ID in the emptyCsvFile.csv - else: save the data to goodResults.csv </code></pre> <p>the questions are - </p> <ol> <li>which data types should I use in order to be more efficient and quick (for the data I parse and for the fetched content)?</li> <li>which HTML parsing library should I use? maybe regex? the span id is fixed and doesn't change when there's data (again, efficient, speed, simplicity)</li> <li>saving to file, keeping an handle to the file for so long etc' - is there a way that will take less resources and will be more efficient to save save the data? (400K lines at least)</li> <li>any other thing I haven't thought about and I need to deal with, and maybe some optimization tips :)</li> </ol> <p>another solution I thought of is using wget, save all pages to disk and then delete all the files who has the same md5sum of an empty document, the only problem is that I'm not saving the empty IDs.</p> <p>by the way, I need to use py2exe and make an exe out of it, so things like scrapy can be hard to use here (it's known to cause issues with py2exe).</p> <p>Thanks!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload