Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat is the state of the art in HTML content extraction?
    primarykey
    data
    text
    <p>There's a lot of scholarly work on HTML content extraction, e.g., Gupta &amp; Kaiser (2005) <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.60.357" rel="nofollow noreferrer">Extracting Content from Accessible Web Pages</a>, and some signs of interest here, e.g., <a href="https://stackoverflow.com/questions/435547/html-downloading-and-text-extraction">one</a>, <a href="https://stackoverflow.com/questions/1386107/text-extraction-from-html-java">two</a>, and <a href="https://stackoverflow.com/questions/1696914/extracting-pure-content-text-from-html-pages-by-excluding-navigation-and-chrome">three</a>, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?</p> <p>Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.</p> <p><strong>Postscript the first</strong>: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.</p> <p><strong>Postscript the second</strong> To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload