Note that there are some explanatory texts on larger screens.

plurals
  1. POParse HTML to Plain Text
    primarykey
    data
    text
    <p>I'm trying to use the MLStripper class that I found recommended on several postings to strip out the html from an email in order to get plain text. The strip_tags function runs into an issue when trying to parse due to the "@" sign. I guess this class is not powerful enough to only parse valid html tags, any recommendations on how to fix the below to handle the "@" or another library to remove html from text? I need to also remove things like &amp; as well. </p> <p><strong>Python:</strong></p> <pre><code>from HTMLParser import HTMLParser class MLStripper(HTMLParser): def __init__(self): self.reset() self.fed = [] def handle_data(self, d): self.fed.append(d) def get_data(self): return ''.join(self.fed) def strip_tags(self, html): s = MLStripper() s.feed(html) return s.get_data() ML = MLStripper() test = ML.strip_tags("&lt;div&gt;&lt;br&gt;On Sep 27, 2012, at 4:11 PM, Mark Smith &lt;marksmith@gmail.com&gt; wrote&lt;/br&gt;&lt;/div&gt;") print test </code></pre> <p><strong>Error:</strong></p> <pre><code>Traceback (most recent call last): File "IMAPReader.py", line 49, in &lt;module&gt; strippedText = ML.strip_tags("&lt;marksmith@gmail.com&gt;") File "IMAPReader.py", line 22, in strip_tags s.feed(html) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 108, in feed self.goahead(0) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 229, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 304, in check_for_whole_start_tag self.error("malformed start tag") File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: malformed start tag, at line 1, column 9 </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload