StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Some observations and questions:</p> <p>(1) ASCII is a subset of UTF-8 in the sense that if a file can be decoded successfully using ASCII, then it can be decoded successfully using UTF-8. So you can cross ASCII off your list.</p> <p>(2) Are the two terms in findreplace ever going to include non-ASCII characters? Note that an answer of "yes" would indicate that the goal of writing an output file in the same character set as the input may be difficult/impossible to achieve.</p> <p>(3) Why not write ALL output files in the SAME handle-all-Unicode-characters encoding e.g. UTF-8?</p> <p>(4) Do the UTF-8 files have a BOM?</p> <p>(5) What other character sets do you reasonably expect to need to handle?</p> <p>(6) Which of the four possibilities (UTF-16LE / UTF-16BE) x (BOM / no BOM) are you calling UTF-16? Note that I'm deliberately not trying to infer anything from the presence of 'utf-16' in your code. </p> <p>(7) Note that <code>chardet</code> doesn't detect UTF-16xE without a BOM. <code>chardet</code> has other blind-spots with non-*x and older charsets.</p> <p><strong>Update</strong> Here are some code snippets that you can use to determine what "ANSI" is, and try decoding using a restricted list of encodings. Note: this presumes a Windows environment.</p> <pre><code># determine "ANSI" import locale ansi = locale.getdefaultlocale()[1] # produces 'cp1252' on my Windows box. f = open("input_file_path", "rb") data = f.read() f.close() if data.startswith("\xEF\xBB\xBF"): # UTF-8 "BOM" encodings = ["utf-8-sig"] elif data.startswith(("\xFF\xFE", "\xFE\xFF")): # UTF-16 BOMs encodings = ["utf16"] else: encodings = ["utf8", ansi, "utf-16le"] # ascii is a subset of both "ANSI" and "UTF-8", so you don't need it. # ISO-8859-1 aka latin1 defines all 256 bytes as valid codepoints; so it will # decode ANYTHING; so if you feel that you must include it, put it LAST. # It is possible that a utf-16le file may be decoded without exception # by the "ansi" codec, and vice versa. # Checking that your input text makes sense, always a very good idea, is very # important when you are guessing encodings. for enc in encodings: try: udata = data.decode(enc) break except UnicodeDecodeError: pass else: raise Exception("unknown encoding") # udata is your file contents as a unicode object # When writing the output file, use 'utf8-sig' as the encoding if you # want a BOM at the start. </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload