StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POOptions for HTML scraping?
text
Body
copied!<p>I'm thinking of trying <a href="http://en.wikipedia.org/wiki/Beautiful_Soup" rel="nofollow noreferrer">Beautiful Soup</a>, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.</p> <p>The story so far:</p> <ul> <li>Python <ul> <li><a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow noreferrer">Beautiful Soup</a></li> <li><a href="http://codespeak.net/lxml/" rel="nofollow noreferrer">lxml</a></li> <li><a href="http://htql.net/" rel="nofollow noreferrer">HTQL</a></li> <li><a href="http://scrapy.org/" rel="nofollow noreferrer">Scrapy</a></li> <li><a href="http://wwwsearch.sourceforge.net/mechanize/" rel="nofollow noreferrer">Mechanize</a></li> </ul></li> <li>Ruby <ul> <li><a href="http://nokogiri.org/" rel="nofollow noreferrer">Nokogiri</a></li> <li><a href="https://github.com/hpricot/hpricot/" rel="nofollow noreferrer">Hpricot</a></li> <li><a href="https://github.com/tenderlove/mechanize" rel="nofollow noreferrer">Mechanize</a></li> <li><a href="http://rubyforge.org/projects/scrapi/" rel="nofollow noreferrer">scrAPI</a></li> <li><a href="http://scrubyt.org/" rel="nofollow noreferrer">scRUBYt!</a></li> <li><a href="https://github.com/felipecsl/wombat" rel="nofollow noreferrer">wombat</a></li> <li><a href="http://watir.com" rel="nofollow noreferrer">Watir</a></li> </ul></li> <li>.NET <ul> <li><a href="http://html-agility-pack.net/?z=codeplex" rel="nofollow noreferrer">Html Agility Pack</a></li> <li><a href="http://watin.org/" rel="nofollow noreferrer">WatiN</a></li> </ul></li> <li>Perl <ul> <li><a href="http://search.cpan.org/dist/WWW-Mechanize/" rel="nofollow noreferrer">WWW::Mechanize</a></li> <li><a href="http://search.cpan.org/dist/Web-Scraper/" rel="nofollow noreferrer">Web-Scraper</a></li> </ul></li> <li>Java <ul> <li><a href="http://home.ccil.org/~cowan/XML/tagsoup/" rel="nofollow noreferrer">Tag Soup</a></li> <li><a href="http://htmlunit.sourceforge.net/" rel="nofollow noreferrer">HtmlUnit</a></li> <li><a href="http://web-harvest.sourceforge.net/" rel="nofollow noreferrer">Web-Harvest</a></li> <li><a href="http://sing.ei.uvigo.es/jarvest" rel="nofollow noreferrer">jARVEST</a></li> <li><a href="http://jsoup.org/" rel="nofollow noreferrer">jsoup</a></li> <li><a href="http://jericho.htmlparser.net/docs/index.html" rel="nofollow noreferrer">Jericho HTML Parser</a></li> </ul></li> <li>JavaScript <ul> <li><a href="https://github.com/request/request" rel="nofollow noreferrer">request</a></li> <li><a href="https://github.com/cheeriojs/cheerio" rel="nofollow noreferrer">cheerio</a></li> <li><a href="http://medialab.github.io/artoo/" rel="nofollow noreferrer">artoo</a></li> <li><a href="https://github.com/johntitus/node-horseman" rel="nofollow noreferrer">node-horseman</a></li> <li><a href="http://phantomjs.org/" rel="nofollow noreferrer">phantomjs</a></li> </ul></li> <li>PHP <ul> <li><a href="https://github.com/FriendsOfPHP/Goutte" rel="nofollow noreferrer">Goutte</a></li> <li><a href="https://github.com/hxseven/htmlSQL" rel="nofollow noreferrer">htmlSQL</a></li> <li><a href="http://sourceforge.net/projects/simplehtmldom/" rel="nofollow noreferrer">PHP Simple HTML DOM Parser</a></li> <li><a href="http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/" rel="nofollow noreferrer">PHP Scraping with CURL</a></li> <li><a href="https://github.com/ScarletsFiction/ScarletsQuery" rel="nofollow noreferrer">ScarletsQuery</a></li> </ul></li> <li>Most of them <ul> <li><a href="http://www.screen-scraper.com/" rel="nofollow noreferrer">Screen-Scraper</a></li> </ul></li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload