Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>There are two steps to what Instapaper does:</p> <ol> <li>Find main <em>content</em> block on the page (excluding headers, footers, menus etc)</li> <li>From this content block <em>extract</em> and <em>format</em> the text</li> </ol> <p>To find the content block (typically some html block element, like a div containing the key page text content) Instapaper uses an algorithm much like the one used by <a href="http://lab.arc90.com/2009/03/02/readability/" rel="nofollow noreferrer">readability</a>. You can look at the <a href="http://code.google.com/p/arc90labs-readability/downloads/detail?name=readability.js&amp;can=2&amp;q=" rel="nofollow noreferrer">source of readability.js</a> to see what's going on, but at its core it tries to find the area on the page with the highest text/link ratio, although it has some other simple scoring metrics too (e.g. off the top of my head, things like ratio of text to commas, para elements etc) that go into the heuristics.</p> <p>Once you have identified the root node element, with the relevant content, you'll need to format it, if you want you can just pull the node element containing the text out of the source document and insert it into yours, but in reality you'll probably want to remove existing styles and apply your own, for a standard look and feel. If you want to output as nice text-only you can use Jericho's <a href="http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Renderer.html" rel="nofollow noreferrer">Renderer</a>.</p> <p><strong>update1</strong>: I should also mention something else Instapaper does - which is follow the 'pagination' links (<em>the "next" or "1", "2", "3" links</em>) of the article to their conclusion, so that a piece that may span many pages in the original will be rendered to you as a single document.</p> <p><strong>update2</strong> I recently came across this <a href="http://tomazkovacic.com/blog/2011/03/02/extracting-article-text-from-html-documents/" rel="nofollow noreferrer"><strong>comparison of text extraction algorithms</strong></a></p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload