Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Yes, it’s frustrating—sometimes <code>type</code> and other programs print gibberish, and sometimes they do not.</p> <p>First of all, Unicode characters will only display <a href="http://support.microsoft.com/kb/99795">if the current console font contains the characters</a>. So use a TrueType font like Lucida Console instead of the default Raster Font.</p> <p>But if the console font doesn’t contain the character you’re trying to display, you’ll see question marks instead of gibberish. When you get gibberish, there’s more going on than just font settings.</p> <p>When programs use standard C-library I/O functions like <code>printf</code>, <strong>the program’s output encoding must match the console’s output encoding</strong>, or you will get gibberish. <code>chcp</code> shows and sets the current codepage. All output using standard C-library I/O functions is treated as if it is in the codepage displayed by <code>chcp</code>.</p> <p>Matching the program’s output encoding with the console’s output encoding can be accomplished in two different ways:</p> <ul> <li><p>A program can get the console’s current codepage using <code>chcp</code> or <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms683169(v=vs.85).aspx"><code>GetConsoleOutputCP</code></a>, and configure itself to output in that encoding, or</p></li> <li><p>You or a program can set the console’s current codepage using <code>chcp</code> or <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms686036(v=vs.85).aspx"><code>SetConsoleOutputCP</code></a> to match the default output encoding of the program.</p></li> </ul> <p>However, programs that use Win32 APIs can write UTF-16LE strings directly to the console with <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms687401(v=vs.85).aspx"><code>WriteConsoleW</code></a>. This is the only way to get correct output without setting codepages. And even when using that function, if a string is not in the UTF-16LE encoding to begin with, a Win32 program must pass the correct codepage to <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx"><code>MultiByteToWideChar</code></a>. Also, <code>WriteConsoleW</code> will not work if the program’s output is redirected; more fiddling is needed in that case.</p> <p><code>type</code> works some of the time because it checks the start of each file for a UTF-16LE <a href="https://en.wikipedia.org/wiki/Byte_order_mark">Byte Order Mark (BOM)</a>, i.e. the bytes <code>0xFF 0xFE</code>. If it finds such a mark, it displays the Unicode characters in the file using <code>WriteConsoleW</code> regardless of the current codepage. But when <code>type</code>ing any file without a UTF-16LE BOM, or for using non-ASCII characters with any command that doesn’t call <code>WriteConsoleW</code>—you will need to set the console codepage and program output encoding to match each other.</p> <hr> <p>How can we find this out?</p> <p>Here’s a test file containing Unicode characters:</p> <pre><code>ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 </code></pre> <p>Here’s a Java program to print out the test file in a bunch of different Unicode encodings. It could be in any programming language; it only prints ASCII characters or encoded bytes to <code>stdout</code>.</p> <pre class="lang-java prettyprint-override"><code>import java.io.*; public class Foo { private static final String BOM = "\ufeff"; private static final String TEST_STRING = "ASCII abcde xyz\n" + "German äöü ÄÖÜ ß\n" + "Polish ąęźżńł\n" + "Russian абвгдеж эюя\n" + "CJK 你好\n"; public static void main(String[] args) throws Exception { String[] encodings = new String[] { "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" }; for (String encoding: encodings) { System.out.println("== " + encoding); for (boolean writeBom: new Boolean[] {false, true}) { System.out.println(writeBom ? "= bom" : "= no bom"); String output = (writeBom ? BOM : "") + TEST_STRING; byte[] bytes = output.getBytes(encoding); System.out.write(bytes); FileOutputStream out = new FileOutputStream("uc-test-" + encoding + (writeBom ? "-bom.txt" : "-nobom.txt")); out.write(bytes); out.close(); } } } } </code></pre> <p>The output in the default codepage? <strong>Total garbage!</strong></p> <pre><code>Z:\andrew\projects\sx\1259084&gt;chcp Active code page: 850 Z:\andrew\projects\sx\1259084&gt;java Foo == UTF-8 = no bom ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ = bom ´╗┐ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ == UTF-16LE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y = bom  ■A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y == UTF-16BE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} = bom ■  A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} == UTF-32LE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺ R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N ♦ O♦ C J K `O }Y = bom  ■ A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺ R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N ♦ O♦ C J K `O }Y == UTF-32BE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} = bom ■  A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} </code></pre> <p>However, what if we <code>type</code> the files that got saved? They contain the exact same bytes that were printed to the console.</p> <pre><code>Z:\andrew\projects\sx\1259084&gt;type *.txt uc-test-UTF-16BE-bom.txt ■  A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} uc-test-UTF-16BE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} uc-test-UTF-16LE-bom.txt ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 uc-test-UTF-16LE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y uc-test-UTF-32BE-bom.txt ■  A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} uc-test-UTF-32BE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} uc-test-UTF-32LE-bom.txt A S C I I a b c d e x y z G e r m a n ä ö ü Ä Ö Ü ß P o l i s h ą ę ź ż ń ł R u s s i a n а б в г д е ж э ю я C J K 你 好 uc-test-UTF-32LE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺ R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N ♦ O♦ C J K `O }Y uc-test-UTF-8-bom.txt ´╗┐ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ uc-test-UTF-8-nobom.txt ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ </code></pre> <p>The <em>only</em> thing that works is UTF-16LE file, with a BOM, printed to the console via <code>type</code>.</p> <p>If we use anything other than <code>type</code> to print the file, we get garbage:</p> <pre><code>Z:\andrew\projects\sx\1259084&gt;copy uc-test-UTF-16LE-bom.txt CON  ■A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y 1 file(s) copied. </code></pre> <p>From the fact that <code>copy CON</code> does not display Unicode correctly, we can conclude that the <code>type</code> command has logic to detect a UTF-16LE BOM at the start of the file, and use special Windows APIs to print it.</p> <p>We can see this by opening <code>cmd.exe</code> in a debugger when it goes to <code>type</code> out a file:</p> <p><img src="https://i.stack.imgur.com/XcX2H.png" alt="enter image description here"></p> <p>After <code>type</code> opens a file, it checks for a BOM of <code>0xFEFF</code>—i.e., the bytes <code>0xFF 0xFE</code> in little-endian—and if there is such a BOM, <code>type</code> sets an internal <code>fOutputUnicode</code> flag. This flag is checked later to decide whether to call <code>WriteConsoleW</code>.</p> <p>But that’s the only way to get <code>type</code> to output Unicode, and only for files that have BOMs and are in UTF-16LE. For all other files, and for programs that don’t have special code to handle console output, your files will be interpreted according to the current codepage, and will likely show up as gibberish.</p> <p>You can emulate how <code>type</code> outputs Unicode to the console in your own programs like so:</p> <pre class="lang-c prettyprint-override"><code>#include &lt;stdio.h&gt; #define UNICODE #include &lt;windows.h&gt; static LPCSTR lpcsTest = "ASCII abcde xyz\n" "German äöü ÄÖÜ ß\n" "Polish ąęźżńł\n" "Russian абвгдеж эюя\n" "CJK 你好\n"; int main() { int n; wchar_t buf[1024]; HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE); n = MultiByteToWideChar(CP_UTF8, 0, lpcsTest, strlen(lpcsTest), buf, sizeof(buf)); WriteConsole(hConsole, buf, n, &amp;n, NULL); return 0; } </code></pre> <p>This program works for printing Unicode on the Windows console using the default codepage.</p> <hr> <p>For the sample Java program, we can get a little bit of correct output by setting the codepage manually, though the output gets messed up in weird ways:</p> <pre><code>Z:\andrew\projects\sx\1259084&gt;chcp 65001 Active code page: 65001 Z:\andrew\projects\sx\1259084&gt;java Foo == UTF-8 = no bom ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 ж эюя CJK 你好 你好 好 � = bom ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 еж эюя CJK 你好 你好 好 � == UTF-16LE = no bom A S C I I a b c d e x y z … </code></pre> <p>However, a C program that sets a Unicode UTF-8 codepage:</p> <pre class="lang-c prettyprint-override"><code>#include &lt;stdio.h&gt; #include &lt;windows.h&gt; int main() { int c, n; UINT oldCodePage; char buf[1024]; oldCodePage = GetConsoleOutputCP(); if (!SetConsoleOutputCP(65001)) { printf("error\n"); } freopen("uc-test-UTF-8-nobom.txt", "rb", stdin); n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin); fwrite(buf, sizeof(buf[0]), n, stdout); SetConsoleOutputCP(oldCodePage); return 0; } </code></pre> <p>does have correct output:</p> <pre><code>Z:\andrew\projects\sx\1259084&gt;.\test ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 </code></pre> <hr> <p>The moral of the story?</p> <ul> <li><code>type</code> can print UTF-16LE files with a BOM regardless of your current codepage</li> <li>Win32 programs can be programmed to output Unicode to the console, using <code>WriteConsoleW</code>.</li> <li>Other programs which set the codepage and adjust their output encoding accordingly can print Unicode on the console regardless of what the codepage was when the program started</li> <li>For everything else you will have to mess around with <code>chcp</code>, and will probably still get weird output.</li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload