StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:</p> <ol> <li><p>As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.</p></li> <li><p>On Unixes, the argv has no fixed encoding:</p> <p>a) File names inserted by tab-completion/globbing will occur in argv <em>verbatim</em> as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.</p> <p>b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)</p></li> </ol> <p>This is annoying for languages such as Python, Haskell or <a href="https://stackoverflow.com/questions/27923366/how-does-the-jvm-determine-the-default-character-encoding-on-linux">Java</a>, which want to treat command line arguments as strings. They need to decide how to decode <code>argv</code> into whatever encoding is used internally for a <code>String</code> (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.</p> <p>The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (<a href="http://www.python.org/dev/peps/pep-0383/" rel="nofollow noreferrer">http://www.python.org/dev/peps/pep-0383/</a>) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.</p> <p>As you can see, the situation is pretty messy :-)</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload