StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p><code>repr()</code> is your friend (except on Python 3.X; use <code>ascii()</code> instead).</p> <pre><code>prompt>\python26\python -c "print repr(open('report.csv','rb').read()[:300])" '\xff\xfeW\x00e\x00b\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00 \x00I\x00n\x00t\x00e \x00r\x00e\x00s\x00t\x00:\x00 \x00f\x00o\x00o\x00b\x00a\x00r\x00\r\x00\n\x00W\x0 [snip] x001\x007\x00\t\x004\x004\x00\r\x00\n\x002\x000\x00' </code></pre> <p>Sure looks like a UTF-16LE BOM (U+FEFF) in the 1st two bytes to me.</p> <p>Notepad.* are NOT your friends. UTF-16 should <em>not</em> be referred to as "UCS-2" or "Unicode".</p> <p>The following should help with what to do next:</p> <pre><code>>>> import codecs >>> lines = list(codecs.open('report.csv', 'r', encoding='UTF-16')) >>> import pprint >>> pprint.pprint(lines[:8]) [u'Web Search Interest: foobar\r\n', u'Worldwide; 2004 - present\r\n', u'\r\n', u'Interest over time\r\n', u'Week\tfoobar\r\n', u'2004-01-04 - 2004-01-10\t44\r\n', u'2004-01-11 - 2004-01-17\t44\r\n', u'2004-01-18 - 2004-01-24\t37\r\n'] >>> </code></pre> <p><strong>Update:</strong> Why your output file looks like gobbledegook.</p> <p>Firstly, you are looking at the files with something (Notepad.* maybe) that knows that the files are allegedly encoded in UTF-16LE, and displays them accordingly. So your input file looks fine.</p> <p>However, your script is reading the input file as raw bytes. It then writes the output file as raw bytes in text mode ('w') (as opposed to binary mode ('wb')). Because you are on Windows, every <code>\n</code> will be replaced by <code>\r\n</code>. This is adding one byte (HALF of a UTF-16 character) to every line. So every SECOND line will be bassackwards aka UTF-16BE ... the letter A which is \x41\x00 in UTF-16LE will lose its trailing \x00 and pick up a leading byte (probably \x00) from the character to the left. \x00\x41 is the UTF-16LE for a CJK ("Asian") character.</p> <p>Suggested reading: the <a href="http://www.amk.ca/python/howto/unicode" rel="nofollow noreferrer">Python Unicode HOWTO</a> and <a href="http://www.joelonsoftware.com/articles/Unicode.html" rel="nofollow noreferrer">this piece by Joel</a>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload