Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>You certainly can use Hpricot to scrape content from any given HTML page.</p> <p>Here is a step-by-step tutorial: <a href="http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/" rel="noreferrer">http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/</a></p> <p>Hpricot is ideal for parsing a file with a known HTML structure using <a href="http://en.wikipedia.org/wiki/XPath" rel="noreferrer">XPath</a> expressions.</p> <p>However, you will struggle to write anything generic that can read <em>any</em> web page and identify the main article text. I think you'd need some sort of rudimentary AI for that (at least) which is well outside the scope of what Hpricot can do. </p> <p>What you could do is perhaps write a set of code for the common HTML formats you want to scrape (perhaps Wordpress, Tumblr, Blogger etc) if there is such a set. </p> <p>I am also sure you could come up with some <a href="http://en.wikipedia.org/wiki/Heuristic" rel="noreferrer">heuristics</a> for attempting it as well (which based on how well Readability works is what I guess they do - it seems it works far from perfectly)</p> <p>First stab at a heuristic:</p> <p>1) Identify (a fixed) set of tags which could be considered to be part of "the main block of text" (e.g. <code>&lt;p&gt;</code> <code>&lt;br&gt;</code> <code>&lt;img&gt;</code> etc).</p> <p>2) Scrape page and find the largest block of text on the page that only contains tags in (1).</p> <p>3) Return text from (2) with tags from (1) removed.</p> <p>Looking at the results of Readability, I reckon this heuristic would work about as well.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload