Note that there are some explanatory texts on larger screens.

plurals
  1. POIs there a standard technique for packing binary data into a UTF-16 string?
    text
    copied!<p>(In .NET) I have arbitrary binary data stored in in a <em>byte[]</em> (an image, for example). Now, I need to store that data in a <em>string</em> (a "Comment" field of a legacy API). Is there a standard technique for <strong>packing</strong> this binary data into a <em>string</em>? By "packing" I mean that for any reasonably large and random data set, <em>bytes.Length/2</em> is about the same as <em>packed.Length</em>; because two bytes are more-or-less a single character.</p> <p>The two "obvious" answers don't meet all the criteria:</p> <pre><code>string base64 = System.Convert.ToBase64String(bytes) </code></pre> <p>doesn't make very efficient use of the <em>string</em> since it only uses 64 characters out of roughly 60,000 available (my storage is a <em>System.String</em>). Going with</p> <pre><code>string utf16 = System.Text.Encoding.Unicode.GetString(bytes) </code></pre> <p>makes better use of the <em>string</em>, but it won't work for data that contains invalid Unicode characters (say mis-matched surrogate pairs). <a href="http://msdn.microsoft.com/en-us/library/ms172827.aspx" rel="nofollow noreferrer">This MSDN article</a> shows this exact (poor) technique.</p> <p>Let's look at a simple example:</p> <pre><code>byte[] bytes = new byte[] { 0x41, 0x00, 0x31, 0x00}; string utf16 = System.Text.Encoding.Unicode.GetString(bytes); byte[] utf16_bytes = System.Text.Encoding.Unicode.GetBytes(utf16); </code></pre> <p>In this case <em>bytes</em> and <em>utf16_bytes</em> are the same, because the orginal <em>bytes</em> were a UTF-16 string. Doing this same procedure with base64 encoding gives 16-member <em>base64_bytes</em> array.</p> <p>Now, repeat the procedure with invalid UTF-16 data:</p> <pre><code>byte[] bytes = new byte[] { 0x41, 0x00, 0x00, 0xD8}; </code></pre> <p>You'll find that <em>utf16_bytes</em> do not match the original data.</p> <p>I've written code that uses U+FFFD as an escape before invalid Unicode characters; it works, but I'd like to know if there is a more standard technique than something I just cooked up on my own. Not to mention, I don't like <em>catch</em>ing the <em>DecoderFallbackException</em> as the way of detecting invalid characters.</p> <p>I guess you could call this a "base BMP" or "base UTF-16" encoding (using all the characters in the Unicode Basic Multilingual Plane). Yes, ideally I'd follow <a href="http://blogs.msdn.com/shawnste/archive/2005/09/26/474105.aspx" rel="nofollow noreferrer">Shawn Steele's advice</a> and pass around <em>byte[]</em>.</p> <hr> <p><strike> I'm going to go with Peter Housel's suggestion as the "right" answer because he's the only that came close to suggesting a "standard technique". </strike></p> <hr> <p>Edit <a href="http://www.unicode.org/mail-arch/unicode-ml/y2004-m05/1671.html" rel="nofollow noreferrer">base16k</a> <a href="http://sites.google.com/site/markusicu/unicode/base16k" rel="nofollow noreferrer">looks</a> even better. Jim Beveridge has an <a href="http://qualapps.blogspot.com/2011/11/base64-for-unicode-utf16.html" rel="nofollow noreferrer">implementation</a>.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload