Note that there are some explanatory texts on larger screens.

plurals
  1. POUsing C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
    primarykey
    data
    text
    <p>Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program.</p> <p>In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any desired memory alignment for the memory block; I used 64-byte alignment.</p> <p>I am using a PC with an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler.</p> <p>My first attempt was:</p> <pre><code>const char* bytevecM; // 4 GB block of memory, 64-byte aligned size_t* psz; // size_t is 64-bits // ... // "m7 &amp; 0xffffff80" selects the 128 byte block to test for all zeros psz = (size_t*)&amp;bytevecM[(unsigned int)m7 &amp; 0xffffff80]; if (psz[0] == 0 &amp;&amp; psz[1] == 0 &amp;&amp; psz[2] == 0 &amp;&amp; psz[3] == 0 &amp;&amp; psz[4] == 0 &amp;&amp; psz[5] == 0 &amp;&amp; psz[6] == 0 &amp;&amp; psz[7] == 0 &amp;&amp; psz[8] == 0 &amp;&amp; psz[9] == 0 &amp;&amp; psz[10] == 0 &amp;&amp; psz[11] == 0 &amp;&amp; psz[12] == 0 &amp;&amp; psz[13] == 0 &amp;&amp; psz[14] == 0 &amp;&amp; psz[15] == 0) continue; // ... </code></pre> <p>VTune profiling of the corresponding assembly follows:</p> <pre><code>cmp qword ptr [rax], 0x0 0.171s jnz 0x14000222 42.426s cmp qword ptr [rax+0x8], 0x0 0.498s jnz 0x14000222 0.358s cmp qword ptr [rax+0x10], 0x0 0.124s jnz 0x14000222 0.031s cmp qword ptr [rax+0x18], 0x0 0.171s jnz 0x14000222 0.031s cmp qword ptr [rax+0x20], 0x0 0.233s jnz 0x14000222 0.560s cmp qword ptr [rax+0x28], 0x0 0.498s jnz 0x14000222 0.358s cmp qword ptr [rax+0x30], 0x0 0.140s jnz 0x14000222 cmp qword ptr [rax+0x38], 0x0 0.124s jnz 0x14000222 cmp qword ptr [rax+0x40], 0x0 0.156s jnz 0x14000222 2.550s cmp qword ptr [rax+0x48], 0x0 0.109s jnz 0x14000222 0.124s cmp qword ptr [rax+0x50], 0x0 0.078s jnz 0x14000222 0.016s cmp qword ptr [rax+0x58], 0x0 0.078s jnz 0x14000222 0.062s cmp qword ptr [rax+0x60], 0x0 0.093s jnz 0x14000222 0.467s cmp qword ptr [rax+0x68], 0x0 0.047s jnz 0x14000222 0.016s cmp qword ptr [rax+0x70], 0x0 0.109s jnz 0x14000222 0.047s cmp qword ptr [rax+0x78], 0x0 0.093s jnz 0x14000222 0.016s </code></pre> <p>I was able to improve on that via Intel instrinsics:</p> <pre><code>const char* bytevecM; // 4 GB block of memory __m128i* psz; // __m128i is 128-bits __m128i one = _mm_set1_epi32(0xffffffff); // all bits one // ... psz = (__m128i*)&amp;bytevecM[(unsigned int)m7 &amp; 0xffffff80]; if (_mm_testz_si128(psz[0], one) &amp;&amp; _mm_testz_si128(psz[1], one) &amp;&amp; _mm_testz_si128(psz[2], one) &amp;&amp; _mm_testz_si128(psz[3], one) &amp;&amp; _mm_testz_si128(psz[4], one) &amp;&amp; _mm_testz_si128(psz[5], one) &amp;&amp; _mm_testz_si128(psz[6], one) &amp;&amp; _mm_testz_si128(psz[7], one)) continue; // ... </code></pre> <p>VTune profiling of the corresponding assembly follows:</p> <pre><code>movdqa xmm0, xmmword ptr [rax] 0.218s ptest xmm0, xmm2 35.425s jnz 0x14000ddd 0.700s movdqa xmm0, xmmword ptr [rax+0x10] 0.124s ptest xmm0, xmm2 0.078s jnz 0x14000ddd 0.218s movdqa xmm0, xmmword ptr [rax+0x20] 0.155s ptest xmm0, xmm2 0.498s jnz 0x14000ddd 0.296s movdqa xmm0, xmmword ptr [rax+0x30] 0.187s ptest xmm0, xmm2 0.031s jnz 0x14000ddd movdqa xmm0, xmmword ptr [rax+0x40] 0.093s ptest xmm0, xmm2 2.162s jnz 0x14000ddd 0.280s movdqa xmm0, xmmword ptr [rax+0x50] 0.109s ptest xmm0, xmm2 0.031s jnz 0x14000ddd 0.124s movdqa xmm0, xmmword ptr [rax+0x60] 0.109s ptest xmm0, xmm2 0.404s jnz 0x14000ddd 0.124s movdqa xmm0, xmmword ptr [rax+0x70] 0.093s ptest xmm0, xmm2 0.078s jnz 0x14000ddd 0.016s </code></pre> <p>As you can see, there are fewer assembly instructions and this version further proved to be faster in timing tests.</p> <p>Since I am quite weak in the area of Intel SSE/AVX instructions, I welcome advice on how they might be better employed to speed up this code.</p> <p>Though I scoured the hundreds of available instrinsics, I may have missed the ideal ones. In particular, I was unable to effectively employ _mm_cmpeq_epi64(); I looked for a "not equal" version of this instrinsic (which seems better suited to this problem) but came up dry. Though the below code "works":</p> <pre><code>if (_mm_testz_si128(_mm_andnot_si128(_mm_cmpeq_epi64(psz[7], _mm_andnot_si128(_mm_cmpeq_epi64(psz[6], _mm_andnot_si128(_mm_cmpeq_epi64(psz[5], _mm_andnot_si128(_mm_cmpeq_epi64(psz[4], _mm_andnot_si128(_mm_cmpeq_epi64(psz[3], _mm_andnot_si128(_mm_cmpeq_epi64(psz[2], _mm_andnot_si128(_mm_cmpeq_epi64(psz[1], _mm_andnot_si128(_mm_cmpeq_epi64(psz[0], zero), one)), one)), one)), one)), one)), one)), one)), one), one)) continue; </code></pre> <p>it is borderline unreadable and (unsurprisingly) proved to be way slower than the two versions given above. I feel sure there must be a more elegant way to employ _mm_cmpeq_epi64() and welcome advice on how that might be achieved.</p> <p>In addition to using intrinsics from C, raw Intel assembly language solutions to this problem are also welcome.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload