StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>If you want to build <a href="http://en.wikipedia.org/wiki/Document_Object_Model" rel="noreferrer">DOM</a> you can use <a href="https://github.com/tmpvar/jsdom" rel="noreferrer">jsdom</a>. </p> <p>There's also <a href="https://github.com/MatthewMueller/cheerio" rel="noreferrer">cheerio</a>, it has the <a href="http://jquery.com/" rel="noreferrer">jQuery</a> interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.</p> <p>You might wanna have a look at <a href="https://github.com/fb55/htmlparser2" rel="noreferrer">htmlparser2</a>, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.</p> <p><a href="https://github.com/inikulin/parse5" rel="noreferrer">parse5</a> also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in <a href="https://github.com/tmpvar/jsdom" rel="noreferrer">jsdom</a>, <a href="https://github.com/angular/angular" rel="noreferrer">Angular</a>, and <a href="https://github.com/Polymer/polymer" rel="noreferrer">Polymer</a>.</p> <p>And if you want to parse HTML for <a href="http://en.wikipedia.org/wiki/Web_scraping" rel="noreferrer">web scraping</a>, you can use <a href="http://developer.yahoo.com/yql/" rel="noreferrer">YQL</a>. There is a <a href="https://github.com/derek/node-yql" rel="noreferrer">node module</a> for it. YQL I think would be the best solution if your HTML is from a <a href="http://en.wikipedia.org/wiki/Static_web_page" rel="noreferrer">static</a> website, since you are relying on a service, not your own code and processing power. Though note that it won't work if the page is disallowed by the robot.txt of the website, YQL won't work with it.</p> <p>If the website you're trying to scrape is <a href="http://en.wikipedia.org/wiki/Dynamic_web_page" rel="noreferrer">dynamic</a> then you should be using a <a href="https://en.wikipedia.org/wiki/Headless_browser" rel="noreferrer">headless browser</a> like <a href="http://phantomjs.org/" rel="noreferrer">phantomjs</a>. Also have a look at <a href="http://casperjs.org/" rel="noreferrer">casperjs</a>, if you're considering phantomjs. And you can control casperjs from node with <a href="https://github.com/WaterfallEngineering/SpookyJS" rel="noreferrer">SpookyJS</a>.</p> <p>Beside phantomjs there's <a href="http://zombie.labnotes.org/" rel="noreferrer">zombiejs</a>. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.</p> <p>There's a <a href="http://net.tutsplus.com/tutorials/javascript-ajax/web-scraping-with-node-js/" rel="noreferrer">nettuts+ toturial</a> for the latter solutions.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload