Note that there are some explanatory texts on larger screens.

plurals
  1. POIs there a way to use readability and python to extract just text, not HTML?
    primarykey
    data
    text
    <p>I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those.</p> <ol> <li>early <a href="https://github.com/gfxmonk/python-readability" rel="nofollow noreferrer">version by gfxmonk</a>, based on BeautifulSoup</li> <li><a href="http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/" rel="nofollow noreferrer">version by minvolai</a> based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.</li> <li><a href="https://github.com/buriy/python-readability" rel="nofollow noreferrer">version by Yuri Baburov aka buriy</a>. Same as minvolai's, depens on lxml. Also depends on <a href="https://github.com/dcramer/chardet.git" rel="nofollow noreferrer">chardet</a> to detect encoding. </li> </ol> <p>I use Yuri's version, as it is most recent, and seems to be in active development. I managed to make it run on Google App Engine using Python 2.7. Now the "problem" is that it returns HTML, whereas I need pure text.</p> <p>The advice in <a href="https://stackoverflow.com/questions/4589323/is-there-a-way-to-use-readability-text-extraction-algorithm-and-a-custom-algor">this Stackoverflow article about links extraction</a>, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.</p> <p>My questions:</p> <ul> <li>Is there a way to get pure text from Python Readability version that I use without forking the code?</li> <li>Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else </li> <li>If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?</li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload