Note that there are some explanatory texts on larger screens.

plurals
  1. POFile.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)
    text
    copied!<p>I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the <code>File.listFiles()</code> and related methods seem to return file names in a different encoding than the rest of the system.</p> <p>Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.</p> <p>Here is a program to demonstrate. It creates a file with a Unicode name then prints out <strong>URL-encoded</strong> versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the <code>File.listFiles()</code> method.</p> <pre><code>String fileName = "Trîcky Nåme"; File file = new File(fileName); file.createNewFile(); System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8")); // Get parent (current) dir and list file contents File parentDir = file.getAbsoluteFile().getParentFile(); File[] children = parentDir.listFiles(); for (File child: children) { System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8")); } </code></pre> <p>Here's what I get when I run this test code on my systems. Note the <code>%CC</code> versus <code>%C3</code> character representations.</p> <p>OS X Snow Leopard:</p> <pre><code>File name: Tri%CC%82cky+Na%CC%8Ame Listed name: Tr%C3%AEcky+N%C3%A5me $ java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode) </code></pre> <p>KUbuntu Linux (running in a VM on same OS X system):</p> <pre><code>File name: Tri%CC%82cky+Na%CC%8Ame Listed name: Tr%C3%AEcky+N%C3%A5me $ java -version java version "1.6.0_18" OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1) OpenJDK Client VM (build 16.0-b13, mixed mode, sharing) </code></pre> <p>I have tried various hacks to get the strings to agree, including setting the <code>file.encoding</code> system property and various <code>LC_CTYPE</code> and <code>LANG</code> environment variables. Nothing helps, nor do I want to resort to such hacks.</p> <p>Unlike <a href="https://stackoverflow.com/questions/2423781/chinese-encoding-issue-while-listing-files">this (somewhat related?) question</a>, I am able to read data from the listed files despite the odd names</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload