StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>You need to unescape HTML entities, and URL-unquote.<br> The standard library has <a href="http://docs.python.org/library/htmlparser.html?highlight=htmlparser#HTMLParser" rel="nofollow"><code>HTMLParser</code></a> and <a href="http://docs.python.org/library/urllib2.html?highlight=urllib2#urllib2" rel="nofollow"><code>urllib2</code></a> to help with those tasks.</p> <pre><code>import HTMLParser, urllib2 markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com"> <em>ada&#x40;graphics.maestro.com</em> <em>mel&#x40;graphics.maestro.com</em>''' result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup)) for line in result.split("\n"): print(line) </code></pre> <p>Result:</p> <pre><code><a href="mailto:lad at maestro dot com"> <em>ada@graphics.maestro.com</em> <em>mel@graphics.maestro.com</em> </code></pre> <hr> <p>Edit:<br> If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.<br> The sample file you uploaded has charset set to <code>cp-1252</code>, so let's try decoding from that to Unicode:</p> <pre><code>import codecs with codecs.open(filename, encoding="cp1252") as fin: decoded = fin.read() result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded)) with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou: fou.write(result) </code></pre> <hr> <p>Edit2:<br> If you don't care about the non-ASCII characters you can simplify a bit:</p> <pre><code>with open(filename) as fin: decoded = fin.read().decode('ascii','ignore') ... </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload