Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>First of all, those aren't Microsoft Word entities—they <strong>are</strong> UTF-8. You're converting them to HTML entities.</p> <p>The Pythonic way to write something like:</p> <pre><code>chr(0xe2) . chr(0x80) . chr(0x98) </code></pre> <p>would be:</p> <pre><code>'\xe2\x80\x98' </code></pre> <p>But Python already has built-in functionality for the type of conversion you want to do:</p> <pre><code>def defang(string): return string.decode('utf-8').encode('ascii', 'xmlcharrefreplace') </code></pre> <p>This will replace the UTF-8 codes in a string for characters like <code>‘</code> with numeric entities like <code>&amp;#8220;</code>.</p> <p>If you want to replace those numeric entities with named ones where possible:</p> <pre><code>import re from htmlentitydefs import codepoint2name def convert_match_to_named(match): num = int(match.group(1)) if num in codepoint2name: return "&amp;%s;" % codepoint2name[num] else: return match.group(0) def defang_named(string): return re.sub('&amp;#(\d+);', convert_match_to_named, defang(string)) </code></pre> <p>And use it like so:</p> <pre><code>&gt;&gt;&gt; defang_named('\xe2\x80\x9cHello, world!\xe2\x80\x9d') '&amp;ldquo;Hello, world!&amp;rdquo;' </code></pre> <hr> <p>To complete the answer, the equivalent code to your example to process a file would look something like this:</p> <pre><code># in Python, it's common to operate a line at a time on a file instead of # reading the entire thing into memory my_file = open("test100.html") for line in my_file: print defang_named(line) my_file.close() </code></pre> <p>Note that this answer is targeted at Python 2.5; the Unicode situation is dramatically different for Python 3+.</p> <p>I also agree with bobince's comment below: if you can just keep the text in UTF-8 format and send it with the correct content-type and charset, do that; if you need it to be in ASCII, then stick with the numeric entities—there's really no need to use the named ones.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload