StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POUsing C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
primarykey
Id
15172102
data
AcceptedAnswerId
15176347
AnswerCount
6
ClosedDate
CommentCount
12
CommunityOwnedDate
CreationDate
2013-03-02T07:43:45.037
FavoriteCount
6
LastActivityDate
2017-04-30T21:50:23.097
LastEditDate
2013-03-06T09:51:50.377
LastEditorUserId
2101396
OwnerUserId
2101396
ParentId
0
PostTypeId
1
Score
16
ViewCount
2259
LastEditorDisplayName
text
Body
Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program. In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any desired memory alignment for the memory block; I used 64-byte alignment. I am using a PC with an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler. My first attempt was: <pre><code>const char* bytevecM; // 4 GB block of memory, 64-byte aligned size_t* psz; // size_t is 64-bits // ... // "m7 & 0xffffff80" selects the 128 byte block to test for all zeros psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80]; if (psz[0] == 0 && psz[1] == 0 && psz[2] == 0 && psz[3] == 0 && psz[4] == 0 && psz[5] == 0 && psz[6] == 0 && psz[7] == 0 && psz[8] == 0 && psz[9] == 0 && psz[10] == 0 && psz[11] == 0 && psz[12] == 0 && psz[13] == 0 && psz[14] == 0 && psz[15] == 0) continue; // ... </code></pre> VTune profiling of the corresponding assembly follows: <pre><code>cmp qword ptr [rax], 0x0 0.171s jnz 0x14000222 42.426s cmp qword ptr [rax+0x8], 0x0 0.498s jnz 0x14000222 0.358s cmp qword ptr [rax+0x10], 0x0 0.124s jnz 0x14000222 0.031s cmp qword ptr [rax+0x18], 0x0 0.171s jnz 0x14000222 0.031s cmp qword ptr [rax+0x20], 0x0 0.233s jnz 0x14000222 0.560s cmp qword ptr [rax+0x28], 0x0 0.498s jnz 0x14000222 0.358s cmp qword ptr [rax+0x30], 0x0 0.140s jnz 0x14000222 cmp qword ptr [rax+0x38], 0x0 0.124s jnz 0x14000222 cmp qword ptr [rax+0x40], 0x0 0.156s jnz 0x14000222 2.550s cmp qword ptr [rax+0x48], 0x0 0.109s jnz 0x14000222 0.124s cmp qword ptr [rax+0x50], 0x0 0.078s jnz 0x14000222 0.016s cmp qword ptr [rax+0x58], 0x0 0.078s jnz 0x14000222 0.062s cmp qword ptr [rax+0x60], 0x0 0.093s jnz 0x14000222 0.467s cmp qword ptr [rax+0x68], 0x0 0.047s jnz 0x14000222 0.016s cmp qword ptr [rax+0x70], 0x0 0.109s jnz 0x14000222 0.047s cmp qword ptr [rax+0x78], 0x0 0.093s jnz 0x14000222 0.016s </code></pre> I was able to improve on that via Intel instrinsics: <pre><code>const char* bytevecM; // 4 GB block of memory __m128i* psz; // __m128i is 128-bits __m128i one = _mm_set1_epi32(0xffffffff); // all bits one // ... psz = (__m128i*)&bytevecM[(unsigned int)m7 & 0xffffff80]; if (_mm_testz_si128(psz[0], one) && _mm_testz_si128(psz[1], one) && _mm_testz_si128(psz[2], one) && _mm_testz_si128(psz[3], one) && _mm_testz_si128(psz[4], one) && _mm_testz_si128(psz[5], one) && _mm_testz_si128(psz[6], one) && _mm_testz_si128(psz[7], one)) continue; // ... </code></pre> VTune profiling of the corresponding assembly follows: <pre><code>movdqa xmm0, xmmword ptr [rax] 0.218s ptest xmm0, xmm2 35.425s jnz 0x14000ddd 0.700s movdqa xmm0, xmmword ptr [rax+0x10] 0.124s ptest xmm0, xmm2 0.078s jnz 0x14000ddd 0.218s movdqa xmm0, xmmword ptr [rax+0x20] 0.155s ptest xmm0, xmm2 0.498s jnz 0x14000ddd 0.296s movdqa xmm0, xmmword ptr [rax+0x30] 0.187s ptest xmm0, xmm2 0.031s jnz 0x14000ddd movdqa xmm0, xmmword ptr [rax+0x40] 0.093s ptest xmm0, xmm2 2.162s jnz 0x14000ddd 0.280s movdqa xmm0, xmmword ptr [rax+0x50] 0.109s ptest xmm0, xmm2 0.031s jnz 0x14000ddd 0.124s movdqa xmm0, xmmword ptr [rax+0x60] 0.109s ptest xmm0, xmm2 0.404s jnz 0x14000ddd 0.124s movdqa xmm0, xmmword ptr [rax+0x70] 0.093s ptest xmm0, xmm2 0.078s jnz 0x14000ddd 0.016s </code></pre> As you can see, there are fewer assembly instructions and this version further proved to be faster in timing tests. Since I am quite weak in the area of Intel SSE/AVX instructions, I welcome advice on how they might be better employed to speed up this code. Though I scoured the hundreds of available instrinsics, I may have missed the ideal ones. In particular, I was unable to effectively employ _mm_cmpeq_epi64(); I looked for a "not equal" version of this instrinsic (which seems better suited to this problem) but came up dry. Though the below code "works": <pre><code>if (_mm_testz_si128(_mm_andnot_si128(_mm_cmpeq_epi64(psz[7], _mm_andnot_si128(_mm_cmpeq_epi64(psz[6], _mm_andnot_si128(_mm_cmpeq_epi64(psz[5], _mm_andnot_si128(_mm_cmpeq_epi64(psz[4], _mm_andnot_si128(_mm_cmpeq_epi64(psz[3], _mm_andnot_si128(_mm_cmpeq_epi64(psz[2], _mm_andnot_si128(_mm_cmpeq_epi64(psz[1], _mm_andnot_si128(_mm_cmpeq_epi64(psz[0], zero), one)), one)), one)), one)), one)), one)), one)), one), one)) continue; </code></pre> it is borderline unreadable and (unsurprisingly) proved to be way slower than the two versions given above. I feel sure there must be a more elegant way to employ _mm_cmpeq_epi64() and welcome advice on how that might be achieved. In addition to using intrinsics from C, raw Intel assembly language solutions to this problem are also welcome.
Tags
<c><performance><optimization><assembly><vtune>
Title
Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USeyepopslikeamosquito
UserOwnerUserId
1. USeyepopslikeamosquito
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POUsing C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POUsing C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POUsing C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.