Note that there are some explanatory texts on larger screens.

plurals
  1. POOptions for HTML scraping?
    text
    copied!<p>I'm thinking of trying <a href="http://en.wikipedia.org/wiki/Beautiful_Soup" rel="nofollow noreferrer">Beautiful Soup</a>, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.</p> <p>The story so far:</p> <ul> <li>Python <ul> <li><a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow noreferrer">Beautiful Soup</a></li> <li><a href="http://codespeak.net/lxml/" rel="nofollow noreferrer">lxml</a></li> <li><a href="http://htql.net/" rel="nofollow noreferrer">HTQL</a></li> <li><a href="http://scrapy.org/" rel="nofollow noreferrer">Scrapy</a></li> <li><a href="http://wwwsearch.sourceforge.net/mechanize/" rel="nofollow noreferrer">Mechanize</a></li> </ul></li> <li>Ruby <ul> <li><a href="http://nokogiri.org/" rel="nofollow noreferrer">Nokogiri</a></li> <li><a href="https://github.com/hpricot/hpricot/" rel="nofollow noreferrer">Hpricot</a></li> <li><a href="https://github.com/tenderlove/mechanize" rel="nofollow noreferrer">Mechanize</a></li> <li><a href="http://rubyforge.org/projects/scrapi/" rel="nofollow noreferrer">scrAPI</a></li> <li><a href="http://scrubyt.org/" rel="nofollow noreferrer">scRUBYt!</a></li> <li><a href="https://github.com/felipecsl/wombat" rel="nofollow noreferrer">wombat</a></li> <li><a href="http://watir.com" rel="nofollow noreferrer">Watir</a></li> </ul></li> <li>.NET <ul> <li><a href="http://html-agility-pack.net/?z=codeplex" rel="nofollow noreferrer">Html Agility Pack</a></li> <li><a href="http://watin.org/" rel="nofollow noreferrer">WatiN</a></li> </ul></li> <li>Perl <ul> <li><a href="http://search.cpan.org/dist/WWW-Mechanize/" rel="nofollow noreferrer">WWW::Mechanize</a></li> <li><a href="http://search.cpan.org/dist/Web-Scraper/" rel="nofollow noreferrer">Web-Scraper</a></li> </ul></li> <li>Java <ul> <li><a href="http://home.ccil.org/~cowan/XML/tagsoup/" rel="nofollow noreferrer">Tag Soup</a></li> <li><a href="http://htmlunit.sourceforge.net/" rel="nofollow noreferrer">HtmlUnit</a></li> <li><a href="http://web-harvest.sourceforge.net/" rel="nofollow noreferrer">Web-Harvest</a></li> <li><a href="http://sing.ei.uvigo.es/jarvest" rel="nofollow noreferrer">jARVEST</a></li> <li><a href="http://jsoup.org/" rel="nofollow noreferrer">jsoup</a></li> <li><a href="http://jericho.htmlparser.net/docs/index.html" rel="nofollow noreferrer">Jericho HTML Parser</a></li> </ul></li> <li>JavaScript <ul> <li><a href="https://github.com/request/request" rel="nofollow noreferrer">request</a></li> <li><a href="https://github.com/cheeriojs/cheerio" rel="nofollow noreferrer">cheerio</a></li> <li><a href="http://medialab.github.io/artoo/" rel="nofollow noreferrer">artoo</a></li> <li><a href="https://github.com/johntitus/node-horseman" rel="nofollow noreferrer">node-horseman</a></li> <li><a href="http://phantomjs.org/" rel="nofollow noreferrer">phantomjs</a></li> </ul></li> <li>PHP <ul> <li><a href="https://github.com/FriendsOfPHP/Goutte" rel="nofollow noreferrer">Goutte</a></li> <li><a href="https://github.com/hxseven/htmlSQL" rel="nofollow noreferrer">htmlSQL</a></li> <li><a href="http://sourceforge.net/projects/simplehtmldom/" rel="nofollow noreferrer">PHP Simple HTML DOM Parser</a></li> <li><a href="http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/" rel="nofollow noreferrer">PHP Scraping with CURL</a></li> <li><a href="https://github.com/ScarletsFiction/ScarletsQuery" rel="nofollow noreferrer">ScarletsQuery</a></li> </ul></li> <li>Most of them <ul> <li><a href="http://www.screen-scraper.com/" rel="nofollow noreferrer">Screen-Scraper</a></li> </ul></li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload