Note that there are some explanatory texts on larger screens.

plurals
  1. POOptimizing SSE code
    primarykey
    data
    text
    <p>I would like some help optimizing the most computationally intensive function of my program. Currently, I am finding that the basic (non-SSE) version is significantly faster (up to 3x). I would thus request your help in rectifying this.</p> <p>The function looks for subsets in unsigned integer vectors, and reports if they exist or not. For your convenience I have included the relevant code snippets only.</p> <p>First up is the basic variant. It checks to see if blocks_ is a subset of x.blocks_.</p> <pre><code>//Check for self comparison if (this == &amp;x) return false; //A subset is equal to or smaller. if (no_bits_ &gt; x.no_bits_) return false; int i; bool equal = false; //Pointers should not change. const unsigned int *tptr = blocks_; const unsigned int *xptr = x.blocks_; for (i = 0; i &lt; no_blocks_; i++, tptr++, xptr++) { if ((*tptr &amp; *xptr) != *tptr) return false; if (*tptr != *xptr) equal = true; } return equal; </code></pre> <p>Then comes the SSE variant, which alas does not perform according to my expectations. Both of these snippets should look for the same things.</p> <pre><code> //starting pointers. const __m128i* start = (__m128i*)&amp;blocks_; const __m128i* xstart = (__m128i*)&amp;x.blocks_; __m128i block; __m128i xblock; //Unsigned ints are 32 bits, meaning 4 can fit in a register. for (i = 0; i &lt; no_blocks_; i+=4) { block = _mm_load_si128(start + i); xblock = _mm_load_si128(xstart + i); //Equivalent to (block &amp; xblock) != block if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff) return false; //Equivalent to block != xblock if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff) equal = true; } return equal; </code></pre> <p>Do you have any suggestions as to how I may improve upon the performance of the SSE version? Am I doing something wrong? Or is this a case where optimization should be done elsewhere?</p> <p>I have not yet added in the leftover calculations for no_blocks_ % 4 != 0, but there is little purpose in doing so until the performance increases, and it would only clutter up the code at this point.</p> <p>Thank you for any help rendered.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload