StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCan a lot of data exceed stack size in Node.js?
text
Body
copied!<p>I am not very familiar with the inner workings of Node.js, but as far as I know, you get 'Maximum call stack size exceeded' errors when you make too many function calls.</p> <p>I'm making a spider that would follow links and I started getting these erros after a random number of crawled URLs. Node doesn't give you a stack trace when this happens, but I'm pretty sure that I don't have any recursion errors. </p> <p>I am using <a href="http://github.com/mikeal/request" rel="nofollow">request</a> to fetch URLs and I <em>was</em> using <a href="https://github.com/MatthewMueller/cheerio" rel="nofollow">cheerio</a> to parse the fetched HTML and detect new links. The stack overflows always happened inside cheerio. When I swapped cheerio for <a href="https://github.com/fb55/node-htmlparser" rel="nofollow">htmlparser2</a> the errors dissapeared. Htmlparser2 is much much lighter since it just emits events on each open tag instead of parsing whole documents and constructing a tree.</p> <p>My theory is that cheerio ate up all the memory in the stack, but I'm not sure if this is even possible?</p> <p>Here's a simplified version of my code (it's for reading only, it won't run):</p> <pre><code>var _ = require('underscore'); var fs = require('fs'); var urllib = require('url'); var request = require('request'); var cheerio = require('cheerio'); var mongo = "This is a global connection to mongodb."; var maxConc = 7; var crawler = { concurrent: 0, queue: [], fetched: {}, fetch: function(url) { var self = this; self.concurrent += 1; self.fetched[url] = 0; request.get(url, { timeout: 10000, pool: { maxSockets: maxConc } }, function(err, response, body){ self.concurrent -= 1; self.fetched[url] = 1; self.extract(url, body); }); }, extract: function(referrer, data) { var self = this; var urls = []; mongo.pages.insert({ _id: referrer, html: data, time: +(new Date) }); /** * THE ERROR HAPPENS HERE, AFTER A RANDOM NUMBER OF FETCHED PAGES **/ cheerio.load(data)('a').each(function(){ var href = resolve(this.attribs.href, referer); // resolves relative urls, not important // Save the href only if it hasn't been fetched, it's not already in the queue and it's not already on this page if(href && !_.has(self.fetched, href) && !_.contains(self.queue, href) && !_.contains(urls, href)) urls.push(href); }); // Check the database to see if we already visited some urls. mongo.pages.find({ _id: { $in: urls } }, { _id: 1 }).toArray(function(err, results){ if(err) results = []; else results = _.pluck(results, '_id'); urls = urls.filter(function(url){ return !_.contains(results, url); }); self.push(urls); }); }, push: function(urls) { Array.prototype.push.apply( this.queue, urls ); var url, self = this; while((url = self.queue.shift()) && this.concurrent < maxConc) { self.fetch( url ); } } }; crawler.fetch( 'http://some.test.url.com/' ); </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload