Note that there are some explanatory texts on larger screens.

plurals
  1. POFloat32 to Float16
    text
    copied!<p>Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?</p> <p>(s = sign e = exponent and m = mantissa)</p> <p>If 32-bit float is 1s7e24m<br> And 16-bit float is 1s5e10m </p> <p>Then is it as simple as doing?</p> <pre><code>int fltInt32; short fltInt16; memcpy( &amp;fltInt32, &amp;flt, sizeof( float ) ); fltInt16 = (fltInt32 &amp; 0x00FFFFFF) &gt;&gt; 14; fltInt16 |= ((fltInt32 &amp; 0x7f000000) &gt;&gt; 26) &lt;&lt; 10; fltInt16 |= ((fltInt32 &amp; 0x80000000) &gt;&gt; 16); </code></pre> <p>I'm assuming it ISN'T that simple ... so can anyone tell me what you DO need to do?</p> <p>Edit: I cam see I've got my exponent shift wrong ... so would THIS be better?</p> <pre><code>fltInt16 = (fltInt32 &amp; 0x007FFFFF) &gt;&gt; 13; fltInt16 |= (fltInt32 &amp; 0x7c000000) &gt;&gt; 13; fltInt16 |= (fltInt32 &amp; 0x80000000) &gt;&gt; 16; </code></pre> <p>I'm hoping this is correct. Apologies if I'm missing something obvious that has been said. Its almost midnight on a friday night ... so I'm not "entirely" sober ;)</p> <p>Edit 2: Ooops. Buggered it again. I want to lose the top 3 bits not the lower! So how about this:</p> <pre><code>fltInt16 = (fltInt32 &amp; 0x007FFFFF) &gt;&gt; 13; fltInt16 |= (fltInt32 &amp; 0x0f800000) &gt;&gt; 13; fltInt16 |= (fltInt32 &amp; 0x80000000) &gt;&gt; 16; </code></pre> <p><strong>Final code should be</strong>:</p> <pre><code>fltInt16 = ((fltInt32 &amp; 0x7fffffff) &gt;&gt; 13) - (0x38000000 &gt;&gt; 13); fltInt16 |= ((fltInt32 &amp; 0x80000000) &gt;&gt; 16); </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload