Note that there are some explanatory texts on larger screens.

plurals
  1. POHow can I identify different encodings without the use of a BOM?
    primarykey
    data
    text
    <p>I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.</p> <p>Here's my question -- without prepending the BOM bytes to each set of data I have (<strong>because I don't have control on the source</strong>) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?</p> <p>My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.</p> <p>Write now my identification/re-encoding code looks like this:</p> <pre><code> // guess encoding if utf-16 then // convert to UTF-8 first try { FileInputStream fis = new FileInputStream(args[args.length-1]); byte[] contents = new byte[fis.available()]; fis.read(contents, 0, contents.length); if ( (contents[0] == (byte)0xFF) &amp;&amp; (contents[1] == (byte)0xFE) ) { String asString = new String(contents, "UTF-16"); byte[] newBytes = asString.getBytes("UTF8"); FileOutputStream fos = new FileOutputStream(args[args.length-1]); fos.write(newBytes); fos.close(); } fis.close(); } catch(Exception e) { e.printStackTrace(); } </code></pre> <p><strong>UPDATE</strong></p> <p>I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters:</p> <pre><code> // guess encoding if utf-16 then // convert to UTF-8 first try { FileInputStream fis = new FileInputStream(args[args.length-1]); byte[] contents = new byte[fis.available()]; fis.read(contents, 0, contents.length); byte[] real = null; int found = 0; // if found a BOM then skip out of here... we just need to convert it if ( (contents[0] == (byte)0xFF) &amp;&amp; (contents[1] == (byte)0xFE) ) { found = 3; real = contents; // no BOM detected but still could be UTF-16 } else { for(int cnt=0; cnt&lt;10; cnt++) { if(contents[cnt] == (byte)0x00) { found++; }; real = new byte[contents.length+2]; real[0] = (byte)0xFF; real[1] = (byte)0xFE; // tack on BOM and copy over new array for(int ib=2; ib &lt; real.length; ib++) { real[ib] = contents[ib-2]; } } } if(found &gt;= 2) { String asString = new String(real, "UTF-16"); byte[] newBytes = asString.getBytes("UTF8"); FileOutputStream fos = new FileOutputStream(args[args.length-1]); fos.write(newBytes); fos.close(); } fis.close(); } catch(Exception e) { e.printStackTrace(); } </code></pre> <p>What do you all think?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload