Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat is the encoding of argv?
    text
    copied!<p>It's not clear to me what encodings are used where in C's <code>argv</code>. In particular, I'm interested in the following scenario:</p> <ul> <li>A user uses locale L1 to create a file whose name, <code>N</code>, contains non-ASCII characters</li> <li>Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument</li> </ul> <p>What sequence of bytes does P see on the command line?</p> <p>I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the <code>zw_TW.big5</code> locale seems to cause my program P to be fed UTF-8 rather than <code>Big5</code>. However, on OS X the same series of actions results in my program P getting a <code>Big5</code> encoded filename.</p> <p>Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):</p> <h1>Windows</h1> <p>File names are stored on disk in some Unicode format. So Windows takes the name <code>N</code>, converts from L1 (the current code page) to a Unicode version of <code>N</code> we will call <code>N1</code>, and stores <code>N1</code> on disk.</p> <p>What I then <em>assume</em> happens is that when tab-completing later on, the name <code>N1</code> is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name <code>N</code> -- but this won't be true if <code>N</code> contained characters unrepresentable in L2. We call the new name <code>N2</code>.</p> <p>When the user actually presses enter to run P with that argument, the name <code>N2</code> is converted back into Unicode, yielding <code>N1</code> again. This <code>N1</code> is now available to the program in UCS2 format via <code>GetCommandLineW</code>/<code>wmain</code>/<code>tmain</code>, but users of <code>GetCommandLine</code>/<code>main</code> will see the name <code>N2</code> in the current locale (code page).</p> <h1>OS X</h1> <p>The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.</p> <p>With a Unicode terminal, I <em>think</em> what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.</p> <p>When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via <code>argv</code>, and the program can decode argv with the current locale into Unicode for display.</p> <h1>Linux</h1> <p>On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as <em>byte strings</em>, not in Unicode. So if you create a file with name <code>N</code> in locale L1 that <code>N</code> as a byte string is what is stored on disk.</p> <p>When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file <strong>as a byte string</strong> is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.</p> <p>When you run a program, I think that buffer is sent directly to <code>argv</code>. Now, what encoding does <code>argv</code> have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but <strong>the file name will be in the L1 encoding</strong>. So <code>argv</code> contains a mixture of two encodings!</p> <h1>Question</h1> <p>I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for <code>argv</code> to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...</p> <h1>Extras</h1> <p>Here is a simple candidate program P that lets you observe encodings for yourself:</p> <pre><code>#include &lt;stdio.h&gt; int main(int argc, char **argv) { if (argc &lt; 2) { printf("Not enough arguments\n"); return 1; } int len = 0; for (char *c = argv[1]; *c; c++, len++) { printf("%d ", (int)(*c)); } printf("\nLength: %d\n", len); return 0; } </code></pre> <p>You can use <code>locale -a</code> to see available locales, and use <code>export LC_ALL=my_encoding</code> to change your locale.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload