StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Using Intels icc Version 11.1 on a 2.4GHz Intel Core 2 Duo I get</p> <pre><code>Macintosh:~ mackie$ icc -O3 -mssse3 -oaddmul addmul.cc && ./addmul 1000 addmul: 0.105 s, 9.525 Gflops, res=0.000000 Macintosh:~ mackie$ icc -v Version 11.1 </code></pre> <p>That is very close to the ideal 9.6 Gflops.</p> <p>EDIT:</p> <p>Oops, looking at the assembly code it seems that icc not only vectorized the multiplication, but also pulled the additions out of the loop. Forcing a stricter fp semantics the code is no longer vectorized: </p> <pre><code>Macintosh:~ mackie$ icc -O3 -mssse3 -oaddmul addmul.cc -fp-model precise && ./addmul 1000 addmul: 0.516 s, 1.938 Gflops, res=1.326463 </code></pre> <p>EDIT2:</p> <p>As requested:</p> <pre><code>Macintosh:~ mackie$ clang -O3 -mssse3 -oaddmul addmul.cc && ./addmul 1000 addmul: 0.209 s, 4.786 Gflops, res=1.326463 Macintosh:~ mackie$ clang -v Apple clang version 3.0 (tags/Apple/clang-211.10.1) (based on LLVM 3.0svn) Target: x86_64-apple-darwin11.2.0 Thread model: posix </code></pre> <p>The inner loop of clang's code looks like this:</p> <pre><code> .align 4, 0x90 LBB2_4: ## =>This Inner Loop Header: Depth=1 addsd %xmm2, %xmm3 addsd %xmm2, %xmm14 addsd %xmm2, %xmm5 addsd %xmm2, %xmm1 addsd %xmm2, %xmm4 mulsd %xmm2, %xmm0 mulsd %xmm2, %xmm6 mulsd %xmm2, %xmm7 mulsd %xmm2, %xmm11 mulsd %xmm2, %xmm13 incl %eax cmpl %r14d, %eax jl LBB2_4 </code></pre> <p>EDIT3:</p> <p>Finally, two suggestions: First, if you like this type of benchmarking, consider using the <code>rdtsc</code> instruction istead of <code>gettimeofday(2)</code>. It is much more accurate and delivers the time in cycles, which is usually what you are interested in anyway. For gcc and friends you can define it like this:</p> <pre><code>#include <stdint.h> static __inline__ uint64_t rdtsc(void) { uint64_t rval; __asm__ volatile ("rdtsc" : "=A" (rval)); return rval; } </code></pre> <p>Second, you should run your benchmark program several times and use the <em>best performance only</em>. In modern operating systems many things happen in parallel, the cpu may be in a low frequency power saving mode, etc. Running the program repeatedly gives you a result that is closer to the ideal case.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload