Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>It's more likely that the <code>\uFEFF</code> char is part of the content read from the file. I doubt it was inserted by the tokeniser. <code>\uFEFF</code> at the beginning of a file is a deprecated form of <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark" rel="noreferrer">Byte Order Mark</a>. If it appears anywhere else, then it is treated as a <a href="http://en.wikipedia.org/wiki/Zero-width_non-breaking_space" rel="noreferrer">zero width non-break space</a>.</p> <p>Was the file written by Microsoft Notepad? From <a href="http://docs.python.org/library/codecs.html#encodings-and-unicode" rel="noreferrer">the codecs module docs</a>:</p> <blockquote> <p>To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.</p> </blockquote> <p>Try reading your file using <a href="http://docs.python.org/library/codecs.html#codecs.open" rel="noreferrer"><code>codecs.open()</code></a> instead. Note the <code>"utf-8-sig"</code> encoding which consumes the BOM.</p> <pre><code>import codecs f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig') text = f.read() a = nltk.word_tokenize(text) </code></pre> <p>Experiment:</p> <pre><code>&gt;&gt;&gt; open("x.txt", "r").read().decode("utf-8") u'\ufeffm\xfcsli' &gt;&gt;&gt; import codecs &gt;&gt;&gt; codecs.open("x.txt", "r", "utf-8-sig").read() u'm\xfcsli' &gt;&gt;&gt; </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload