Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Floating point values (IEEE754 ones, anyway) basically have three components:</p> <ul> <li>a sign <code>s</code>;</li> <li>a series of exponent bits <code>e</code>; and</li> <li>a series of mantissa bits <code>m</code>.</li> </ul> <p>The precision dictates how many bits are available for the exponent and mantissa. Let's examine the value 0.1 for single-precision floating point:</p> <pre><code>s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n 0 01111011 10011001100110011001101 ||||||||||||||||||||||+- 8388608 |||||||||||||||||||||+-- 4194304 ||||||||||||||||||||+--- 2097152 |||||||||||||||||||+---- 1048576 ||||||||||||||||||+----- 524288 |||||||||||||||||+------ 262144 ||||||||||||||||+------- 131072 |||||||||||||||+-------- 65536 ||||||||||||||+--------- 32768 |||||||||||||+---------- 16384 ||||||||||||+----------- 8192 |||||||||||+------------ 4096 ||||||||||+------------- 2048 |||||||||+-------------- 1024 ||||||||+--------------- 512 |||||||+---------------- 256 ||||||+----------------- 128 |||||+------------------ 64 ||||+------------------- 32 |||+-------------------- 16 ||+--------------------- 8 |+---------------------- 4 +----------------------- 2 </code></pre> <p>The sign is positive, that's pretty easy.</p> <p>The exponent is <code>64+32+16+8+2+1 = 123 - 127 bias = -4</code>, so the multiplier is 2<sup>-4</sup> or <code>1/16</code>. The bias is there so that you can get really small numbers (like 10<sup>-30</sup>) as well as large ones.</p> <p>The mantissa is chunky. It consists of <code>1</code> (the implicit base) plus (for all those bits with each being worth 1/(2<sup>n</sup>) as <code>n</code> starts at <code>1</code> and increases to the right), <code>{1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}</code>.</p> <p>When you add all these up, you get <code>1.60000002384185791015625</code>.</p> <p>When you multiply that by the 2<sup>-4</sup> multiplier, you get <code>0.100000001490116119384765625</code>, which is why they say you cannot represent <code>0.1</code> exactly as an IEEE754 float.</p> <p>In terms of <em>converting</em> integers to floats, if you have as many bits in the mantissa (including the implicit 1), you can just transfer the integer bit pattern over and select the correct exponent. There will be no loss of precision. For example a double precision IEEE754 (64 bits, 52/53 of those being mantissa) has no problem taking on a 32-bit integer.</p> <p>If there are more bits in your integer (such as a 32-bit integer and a 32-bit single precision float, which only has 23/24 bits of mantissa) then you need to scale the integer.</p> <p>This involves stripping off the least significant bits (rounding actually) so that it will fit into the mantissa bits. That involves loss of precision of course but that's unavoidable.</p> <hr> <p>Let's have a look at a specific value, <code>123456789</code>. The following program dumps the bits of each data type.</p> <pre><code>#include &lt;stdio.h&gt; static void dumpBits (char *desc, unsigned char *addr, size_t sz) { unsigned char mask; printf ("%s:\n ", desc); while (sz-- != 0) { putchar (' '); for (mask = 0x80; mask &gt; 0; mask &gt;&gt;= 1, addr++) if (((addr[sz]) &amp; mask) == 0) putchar ('0'); else putchar ('1'); } putchar ('\n'); } int main (void) { int intNum = 123456789; float fltNum = intNum; double dblNum = intNum; printf ("%d %f %f\n",intNum, fltNum, dblNum); dumpBits ("Integer", (unsigned char *)(&amp;intNum), sizeof (int)); dumpBits ("Float", (unsigned char *)(&amp;fltNum), sizeof (float)); dumpBits ("Double", (unsigned char *)(&amp;dblNum), sizeof (double)); return 0; } </code></pre> <p>The output on my system is as follows:</p> <pre><code>123456789 123456792.000000 123456789.000000 integer: 00000111 01011011 11001101 00010101 float: 01001100 11101011 01111001 10100011 double: 01000001 10011101 01101111 00110100 01010100 00000000 00000000 00000000 </code></pre> <p>And we'll look at these one at a time. First the integer, simple powers of two:</p> <pre><code> 00000111 01011011 11001101 00010101 ||| | || || || || | | | +-&gt; 1 ||| | || || || || | | +---&gt; 4 ||| | || || || || | +-----&gt; 16 ||| | || || || || +----------&gt; 256 ||| | || || || |+------------&gt; 1024 ||| | || || || +-------------&gt; 2048 ||| | || || |+----------------&gt; 16384 ||| | || || +-----------------&gt; 32768 ||| | || |+-------------------&gt; 65536 ||| | || +--------------------&gt; 131072 ||| | |+----------------------&gt; 524288 ||| | +-----------------------&gt; 1048576 ||| +-------------------------&gt; 4194304 ||+----------------------------&gt; 16777216 |+-----------------------------&gt; 33554432 +------------------------------&gt; 67108864 ========== 123456789 </code></pre> <p>Now let's look at the single precision float. Notice the bit pattern of the mantissa matching the integer as a near-perfect match:</p> <pre><code>mantissa: 11 01011011 11001101 00011 (spaced out). integer: 00000111 01011011 11001101 00010101 (untouched). </code></pre> <p>There's an <em>implicit</em> <code>1</code> bit to the left of the mantissa and it's also been rounded at the other end, which is where that loss of precision comes from (the value changing from <code>123456789</code> to <code>123456792</code> as in the output from that program above).</p> <p>Working out the values:</p> <pre><code>s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n 0 10011001 11010110111100110100011 || | || |||| || | |+- 8388608 || | || |||| || | +-- 4194304 || | || |||| || +------ 262144 || | || |||| |+-------- 65536 || | || |||| +--------- 32768 || | || |||+------------ 4096 || | || ||+------------- 2048 || | || |+-------------- 1024 || | || +--------------- 512 || | |+----------------- 128 || | +------------------ 64 || +-------------------- 16 |+---------------------- 4 +----------------------- 2 </code></pre> <p>The sign is positive. The exponent is <code>128+16+8+1 = 153 - 127 bias = 26</code>, so the multiplier is 2<sup>26</sup> or <code>67108864</code>.</p> <p>The mantissa is <code>1</code> (the implicit base) plus (as explained above), <code>{1/2, 1/4, 1/16, 1/64, 1/128, 1/512, 1/1024, 1/2048, 1/4096, 1/32768, 1/65536, 1/262144, 1/4194304, 1/8388608}</code>. When you add all these up, you get <code>1.83964955806732177734375</code>.</p> <p>When you multiply that by the 2<sup>26</sup> multiplier, you get <code>123456792</code>, the same as the program output.</p> <p>The double bitmask output is:</p> <pre><code>s eeeeeeeeeee mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm 0 10000011001 1101011011110011010001010100000000000000000000000000 </code></pre> <p>I am <em>not</em> going to go through the process of figuring out the value of that beast :-) However, I <em>will</em> show the mantissa next to the integer format to show the common bit representation:</p> <pre><code>mantissa: 11 01011011 11001101 00010101 000...000 (spaced out). integer: 00000111 01011011 11001101 00010101 (untouched). </code></pre> <p>You can once again see the commonality with the implicit bit on the left and the vastly greater bit availability on the right, which is why there's no loss of precision in this case.</p> <hr> <p>In terms of converting between floats and doubles, that's also reasonably easy to understand.</p> <p>You first have to check the <em>special</em> values such as NaN and the infinities. These are indicated by special exponent/mantissa combinations and it's probably easier to detect these up front ang generate the equivalent in the new format.</p> <p>Then in the case where you're going from double to float, you obviously have less of a range available to you since there are less bits in the exponent. If your double is outside the range of a float, you need to handle that.</p> <p>Assuming it will fit, you then need to:</p> <ul> <li>rebase the exponent (the bias is different for the two types).</li> <li>copy as many bits from the mantissa as will fit (rounding if necessary).</li> <li>padding out the rest of the target mantissa (if any) with zero bits.</li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload