Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>You need to unescape HTML entities, and URL-unquote.<br> The standard library has <a href="http://docs.python.org/library/htmlparser.html?highlight=htmlparser#HTMLParser" rel="nofollow"><code>HTMLParser</code></a> and <a href="http://docs.python.org/library/urllib2.html?highlight=urllib2#urllib2" rel="nofollow"><code>urllib2</code></a> to help with those tasks.</p> <pre><code>import HTMLParser, urllib2 markup = '''&lt;a href="mailto:lad%20at%20maestro%20dot%20com"&gt; &lt;em&gt;ada&amp;#x40;graphics.maestro.com&lt;/em&gt; &lt;em&gt;mel&amp;#x40;graphics.maestro.com&lt;/em&gt;''' result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup)) for line in result.split("\n"): print(line) </code></pre> <p>Result:</p> <pre><code>&lt;a href="mailto:lad at maestro dot com"&gt; &lt;em&gt;ada@graphics.maestro.com&lt;/em&gt; &lt;em&gt;mel@graphics.maestro.com&lt;/em&gt; </code></pre> <hr> <p>Edit:<br> If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.<br> The sample file you uploaded has charset set to <code>cp-1252</code>, so let's try decoding from that to Unicode:</p> <pre><code>import codecs with codecs.open(filename, encoding="cp1252") as fin: decoded = fin.read() result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded)) with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou: fou.write(result) </code></pre> <hr> <p>Edit2:<br> If you don't care about the non-ASCII characters you can simplify a bit:</p> <pre><code>with open(filename) as fin: decoded = fin.read().decode('ascii','ignore') ... </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload