StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow can I identify different encodings without the use of a BOM?
primarykey
Id
1344452
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
8
CommunityOwnedDate
CreationDate
2009-08-28T00:31:57.043
FavoriteCount
0
LastActivityDate
2009-08-28T20:33:47.123
LastEditDate
2009-08-28T20:33:47.123
LastEditorUserId
39110
OwnerUserId
39110
ParentId
0
PostTypeId
1
Score
0
ViewCount
1070
LastEditorDisplayName
text
Body
I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it. Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road? My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up. Write now my identification/re-encoding code looks like this: <pre><code> // guess encoding if utf-16 then // convert to UTF-8 first try { FileInputStream fis = new FileInputStream(args[args.length-1]); byte[] contents = new byte[fis.available()]; fis.read(contents, 0, contents.length); if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) { String asString = new String(contents, "UTF-16"); byte[] newBytes = asString.getBytes("UTF8"); FileOutputStream fos = new FileOutputStream(args[args.length-1]); fos.write(newBytes); fos.close(); } fis.close(); } catch(Exception e) { e.printStackTrace(); } </code></pre> UPDATE I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters: <pre><code> // guess encoding if utf-16 then // convert to UTF-8 first try { FileInputStream fis = new FileInputStream(args[args.length-1]); byte[] contents = new byte[fis.available()]; fis.read(contents, 0, contents.length); byte[] real = null; int found = 0; // if found a BOM then skip out of here... we just need to convert it if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) { found = 3; real = contents; // no BOM detected but still could be UTF-16 } else { for(int cnt=0; cnt<10; cnt++) { if(contents[cnt] == (byte)0x00) { found++; }; real = new byte[contents.length+2]; real[0] = (byte)0xFF; real[1] = (byte)0xFE; // tack on BOM and copy over new array for(int ib=2; ib < real.length; ib++) { real[ib] = contents[ib-2]; } } } if(found >= 2) { String asString = new String(real, "UTF-16"); byte[] newBytes = asString.getBytes("UTF8"); FileOutputStream fos = new FileOutputStream(args[args.length-1]); fos.write(newBytes); fos.close(); } fis.close(); } catch(Exception e) { e.printStackTrace(); } </code></pre> What do you all think?
Tags
<java><utf-8><utf-16><byte-order-mark>
Title
How can I identify different encodings without the use of a BOM?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USeyberg
UserOwnerUserId
1. USeyberg
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.