Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>There are three possibilities I see here.</p> <p>First, your data might not suit wide comparisons. If there's a high chance that <code>(*tptr &amp; *xptr) != *tptr</code> within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code &amp; data to accomplish the same thing.</p> <p>Second, your SSE code may be incorrect. It's not totally clear here. If <code>no_blocks_</code> is identical between the two samples, then <code>start + i</code> is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.</p> <p>Third, SSE <em>really</em> likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.</p> <p>Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the <code>block != xblock</code> branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per <code>int</code> to 0.25.</p> <pre><code>bool equal(unsigned const *a, unsigned const *b, unsigned count) { __m128i eq1 = _mm_setzero_si128(); __m128i eq2 = _mm_setzero_si128(); for (unsigned i = 0; i != count; i += 8) { __m128i xa1 = _mm_load_si128((__m128i const*)(a + i)); __m128i xb1 = _mm_load_si128((__m128i const*)(b + i)); eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1)); xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1)); __m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4)); __m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4)); eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2)); xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2)); if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF) return false; } return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0; } </code></pre> <p>If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload