Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><strong>Short Answer:</strong> It's a compiler hiccup. x64 optimizer fail.</p> <hr> <p><strong>Long Answer:</strong></p> <p>This x86 version is very slow if SSE2 is disabled. But I'm able to reproduce the results with SSE2 enabled in x86.</p> <p>If you dive into the assembly of that inner-most loop. The x64 version has two extra memory copies at the end.</p> <p><strong>x86:</strong></p> <pre><code>$LL71@main: movsd xmm2, QWORD PTR [eax-8] movsd xmm0, QWORD PTR [eax-16] movsd xmm3, QWORD PTR [eax] movapd xmm1, xmm0 mulsd xmm0, QWORD PTR __real@3fa60418a0000000 movapd xmm7, xmm2 mulsd xmm2, QWORD PTR __real@3f95810620000000 mulsd xmm7, xmm5 mulsd xmm1, xmm4 addsd xmm1, xmm7 movapd xmm7, xmm3 mulsd xmm3, QWORD PTR __real@3fdcccccc0000000 mulsd xmm7, xmm6 add eax, 24 ; 00000018H addsd xmm1, xmm7 addsd xmm0, xmm2 movq QWORD PTR [ecx], xmm1 addsd xmm0, xmm3 movq QWORD PTR [ecx+8], xmm0 lea edx, DWORD PTR [eax-16] add ecx, 16 ; 00000010H cmp edx, esi jne SHORT $LL71@main </code></pre> <p><strong>x64:</strong></p> <pre><code>$LL175@main: movsdx xmm3, QWORD PTR [rdx-8] movsdx xmm5, QWORD PTR [rdx-16] movsdx xmm4, QWORD PTR [rdx] movapd xmm2, xmm3 mulsd xmm2, xmm6 movapd xmm0, xmm5 mulsd xmm0, xmm7 addsd xmm2, xmm0 movapd xmm1, xmm4 mulsd xmm1, xmm8 addsd xmm2, xmm1 movsdx QWORD PTR r$109492[rsp], xmm2 mulsd xmm5, xmm9 mulsd xmm3, xmm10 addsd xmm5, xmm3 mulsd xmm4, xmm11 addsd xmm5, xmm4 movsdx QWORD PTR r$109492[rsp+8], xmm5 mov rcx, QWORD PTR r$109492[rsp] mov QWORD PTR [rax], rcx mov rcx, QWORD PTR r$109492[rsp+8] mov QWORD PTR [rax+8], rcx add rax, 16 add rdx, 24 lea rcx, QWORD PTR [rdx-16] cmp rcx, rbx jne SHORT $LL175@main </code></pre> <p>The x64 version has a lot more (unexplained) moves at the end of the loop. It looks like some sort of memory-to-memory data-copy.</p> <h1>EDIT:</h1> <p>It turns out that the x64 optimizer isn't able to optimize out the following copy:</p> <pre><code>(*i2) = r; </code></pre> <p>This is why the inner loop has two extra memory copies. If you change the loop to this:</p> <pre><code>std::for_each(m.begin(), m.end(), [&amp;](const Vector&amp; v) { i2-&gt;x = Dot(axisX, v); i2-&gt;y = Dot(axisY, v); ++i2; }); </code></pre> <p>This eliminates the copies. Now the x64 version is just as fast as the x86 version:</p> <pre><code>x86: 0.0249423 x64: 0.0249348 </code></pre> <p><strong>Lesson Learned:</strong> Compilers aren't perfect.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload