Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>As Alexandre C. explains, <a href="http://en.wikipedia.org/wiki/IEEE_754-2008" rel="nofollow">IEEE doubles</a> have a 53 bit mantissa (52 stored and top bit implied), and float has 24 bits (23 bits stored, and top bit implied).</p> <p>Edit: (Thanks for feedback, I hope this is clearer)</p> <p>When an integer is converted to a double <code>double f = (double)1024;</code>, the number is held with an appropriate exponent (1023+10), and the <em>same</em> bit pattern is effectively stored as the original integer (Actually IEEE binary floating point does not store the top bit. IEEE floating point numbers are 'normalised' to have the top bit = 1, by adjusting the exponent, then the top 1 is trimmed off because it is 'implied', which saves a bit of storage).</p> <p>A 32bit integer will require a double to hold its value <em>perfectly</em>, and an 8bit integer will be held <em>perfectly</em> in a float. There is no loss of information there. It can be converted back to an integer without loss. Loss happens with arithmetic, and fractional values.</p> <p>The integer is not mapped to +/-1 unless code does it. When code divides that 32bit integer, stored as a double, to map it to the range +/-1, then error will very likely be introduced.</p> <p>That mapping to +/-1 will loose some of the 53bit precision, but the error will only be in the lowest bits, well below the 32bits needed for the original integer. Subsequent operations might also lose precision. For example multiplying two numbers with a resulting range of more than 53 bits of precision will lose some bits (i.e. multiply two numbers with mantissa's more than 27 significant bits). </p> <p>An explanation of floating point which might be helpful is <a href="http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html" rel="nofollow">"What Every Computer Scientist Should Know About Floating-Point Arithmetic"</a> It explains some of the counter-intuitive (to me) behaviour of floating point numbers. </p> <p>For example, the number 0.1 can <strong>not</strong> be held exactly in an IEEE binary floating point double.</p> <p>This program might help you see what is happening:</p> <pre><code>/* Demonstrate IEEE 'double' encoding on x86 * Show bit patterns and 'printf' output for double values * Show error representing 0.1, and accumulated error of adding 0.1 many times * G Bulmer 2012 */ #include &lt;stdio.h&gt; typedef struct { unsigned long long mantissa :52; unsigned exponent :11; unsigned sign :1; } double_bits; const unsigned exponent_offset = 1023; typedef union { double d; unsigned long long l; double_bits b; } Xlate; void print_xlate(Xlate val) { const long long IMPLIED = (1LL&lt;&lt;52); if (val.b.exponent == 0) { /* zero? */ printf("val: d: %19lf bits: %016llX [sign: %u exponent: zero=%u mantissa: %llX]\n", val.d, val.l, val.b.sign, val.b.exponent, val.b.mantissa); } else { printf("val: d: %19lf bits: %016llX [sign: %u exponent: 2^%4-d mantissa: %llX]\n", val.d, val.l, val.b.sign, ((int)val.b.exponent)-exponent_offset, (IMPLIED|val.b.mantissa)); } } double add_many(double d, int many) { double accum = 0.0; while (many-- &gt; 0) { /* only works for +d */ accum += d; } return accum; } int main (int argc, const char * argv[]) { Xlate val; val.b.sign = 0; val.b.exponent = exponent_offset+1; val.b.mantissa = 0; print_xlate(val); val.d = 1.0; print_xlate(val); val.d = 0.0; print_xlate(val); val.d = -1.0; print_xlate(val); val.d = 3.0; print_xlate(val); val.d = 7.0; print_xlate(val); val.d = (double)((1LL&lt;&lt;31)-1LL); print_xlate(val); val.d = 2147483647.0; print_xlate(val); val.d = 10000.0; print_xlate(val); val.d = 100000.0; print_xlate(val); val.d = 1000000.0; print_xlate(val); val.d = 0.1; print_xlate(val); val.d = add_many(0.1, 100000); print_xlate(val); val.d = add_many(0.1, 1000000); print_xlate(val); val.d = add_many(0.1, 10000000); print_xlate(val); val.d = add_many(0.1,10); print_xlate(val); val.d *= 2147483647.0; print_xlate(val); int i = val.d; printf("int i=truncate(d)=%d\n", i); int j = lround(val.d); printf("int i=lround(d)=%d\n", j); val.d = add_many(0.0001,1000)-0.1; print_xlate(val); return 0; } </code></pre> <p>The output is:</p> <pre><code>val: d: 2.000000 bits: 4000000000000000 [sign: 0 exponent: 2^1 mantissa: 10000000000000] val: d: 1.000000 bits: 3FF0000000000000 [sign: 0 exponent: 2^0 mantissa: 10000000000000] val: d: 0.000000 bits: 0000000000000000 [sign: 0 exponent: zero=0 mantissa: 0] val: d: -1.000000 bits: BFF0000000000000 [sign: 1 exponent: 2^0 mantissa: 10000000000000] val: d: 3.000000 bits: 4008000000000000 [sign: 0 exponent: 2^1 mantissa: 18000000000000] val: d: 7.000000 bits: 401C000000000000 [sign: 0 exponent: 2^2 mantissa: 1C000000000000] val: d: 2147483647.000000 bits: 41DFFFFFFFC00000 [sign: 0 exponent: 2^30 mantissa: 1FFFFFFFC00000] val: d: 2147483647.000000 bits: 41DFFFFFFFC00000 [sign: 0 exponent: 2^30 mantissa: 1FFFFFFFC00000] val: d: 10000.000000 bits: 40C3880000000000 [sign: 0 exponent: 2^13 mantissa: 13880000000000] val: d: 100000.000000 bits: 40F86A0000000000 [sign: 0 exponent: 2^16 mantissa: 186A0000000000] val: d: 1000000.000000 bits: 412E848000000000 [sign: 0 exponent: 2^19 mantissa: 1E848000000000] val: d: 0.100000 bits: 3FB999999999999A [sign: 0 exponent: 2^-4 mantissa: 1999999999999A] val: d: 10000.000000 bits: 40C388000000287A [sign: 0 exponent: 2^13 mantissa: 1388000000287A] val: d: 100000.000001 bits: 40F86A00000165CB [sign: 0 exponent: 2^16 mantissa: 186A00000165CB] val: d: 999999.999839 bits: 412E847FFFEAE4E9 [sign: 0 exponent: 2^19 mantissa: 1E847FFFEAE4E9] val: d: 1.000000 bits: 3FEFFFFFFFFFFFFF [sign: 0 exponent: 2^-1 mantissa: 1FFFFFFFFFFFFF] val: d: 2147483647.000000 bits: 41DFFFFFFFBFFFFF [sign: 0 exponent: 2^30 mantissa: 1FFFFFFFBFFFFF] int i=truncate(d)=2147483646 int i=lround(d)=2147483647 val: d: 0.000000 bits: 3CE0800000000000 [sign: 0 exponent: 2^-49 mantissa: 10800000000000] </code></pre> <p>That shows a full 32-bit int is represented exactly, and 0.1 is not. It shows that printf does not print exactly the floating point number but rounds or truncates (a thing to be wary of). It also illustrates that the amount of error in that representation of 0.1 doesn't accumulate to a large enough value in 1,000,000 add operations to cause printf to print it. It shows that the original integer can be recovered by rounding, but not assignment because assignment truncates. It shows that the subtraction operation can 'amplify' error (all that is left after that subtraction is error), and hence arithmetic should be carefully analysed.</p> <p>To put this into the context of music, where the sample rate might be 96KHz. It would take more than 10 seconds of additions before the error had built up enough for it to cause the top 32bits to contain more than 1 bit in error. </p> <p>Further. Christopher “Monty” Montgomery who created Ogg and Vorbis argues that 24 bits should be more than enough for audio in an article on music, sampling rate and sample resolution <a href="http://people.xiph.org/~xiphmont/demo/neil-young.html" rel="nofollow">24/192 Music Downloads ...and why they make no sense</a></p> <p><strong>Summary</strong> <br/> double holds 32-bit integers perfectly. There are rational decimal numbers of the form N/M (where M and N can be represented by a 32bit integer) which can <strong>not</strong> be represented by a finite sequence of binary-fraction bits. So, when an integer is mapped to the range +/-1, and hence is converted to a rational number (N/M) some numbers can not be represented by the finite number of bits in a double's fractional part, so errors will creep in. </p> <p>Those errors are typically very small, in the lowest bits, hence well below the upper 32 bits. So they can be converted back and forth between integer and double using rounding, and the error of double representation will not cause the integer to be wrong. BUT, arithmetic can change error. Incorrectly constructed arithmetic can cause the errors to grow rapidly, and could grow to a magnitude where the original integer value has been corrupted.</p> <p><em>Other thoughts: If precision is critical, there are other ways you might use doubles. None of them are as convenient as mapping to +/-1. Everything I can think of would require the arithmetic operations to be tracked, which would best be done using C++ wrapper classes. This would dramatically slow calculation, so may be pointless.</em></p> <p><em>This is a very sneaky way of doing <a href="http://blog.sigfpe.com/2005/07/automatic-differentiation.html" rel="nofollow">'Automatic Diferentiation'</a> by wrapping arithmetic in classes which keep track of extra information. I think the ideas in there might inspire an approach. It might even help identify where precision is lost.</em> </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload