Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Using <code>gcc</code> and applying a diff to the generated assembly yields only this difference:</p> <pre><code>73c68,69 &lt; movss LCPI1_0(%rip), %xmm1 --- &gt; movabsq $0, %rcx &gt; cvtsi2ssq %rcx, %xmm1 81d76 &lt; subss %xmm1, %xmm0 </code></pre> <p>The <code>cvtsi2ssq</code> one being 10 times slower indeed.</p> <p>Apparently, the <code>float</code> version uses an <a href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers" rel="noreferrer">XMM</a> register loaded from memory, while the <code>int</code> version converts a real <code>int</code> value 0 to <code>float</code> using the <code>cvtsi2ssq</code> instruction, taking a lot of time. Passing <code>-O3</code> to gcc doesn't help. (gcc version 4.2.1.)</p> <p>(Using <code>double</code> instead of <code>float</code> doesn't matter, except that it changes the <code>cvtsi2ssq</code> into a <code>cvtsi2sdq</code>.)</p> <p><strong>Update</strong> </p> <p>Some extra tests show that it is not necessarily the <code>cvtsi2ssq</code> instruction. Once eliminated (using a <code>int ai=0;float a=ai;</code> and using <code>a</code> instead of <code>0</code>), the speed difference remains. So @Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between <code>0</code> and <code>0.1f</code>. The turning point in the above code is approximately at <code>0.00000000000000000000000000000001</code>, when the loops suddenly takes 10 times as long.</p> <p><strong>Update&lt;&lt;1</strong> </p> <p>A small visualisation of this interesting phenomenon:</p> <ul> <li>Column 1: a float, divided by 2 for every iteration</li> <li>Column 2: the binary representation of this float</li> <li>Column 3: the time taken to sum this float 1e7 times</li> </ul> <p>You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.</p> <pre><code>0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms 0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms 0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms 0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms 0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms 0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms 0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms 0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms 0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms 0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms 0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms 0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms 0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms 0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms 0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms 0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms 0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms 0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms 0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms 0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms 0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms 0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms 0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms 0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms 0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms 0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms 0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms 0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms 0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms 0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms 0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms 0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms 0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms 0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms 0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms 0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms 0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms 0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms 0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms 0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms </code></pre> <p>An equivalent discussion about ARM can be found in Stack&nbsp;Overflow question <em><a href="https://stackoverflow.com/questions/9350810/denormalized-floating-point-in-objective-c/9350820#9350820">Denormalized floating point in Objective-C?</a></em>.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload