Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You can exploit some minor parallelism for large register lengths and small sums. For instance, adding up 16 values of 1 byte (which happen to fit into one sse register) requires only log<sub>2</sub>16 additions and an equal number of shifts.<br> Not much, but faster than 15 depended additions and the additional memory accesses.</p> <pre><code>__m128i x = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); x = _mm_add_epi8(x, _mm_srli_si128(x, 1)); x = _mm_add_epi8(x, _mm_srli_si128(x, 2)); x = _mm_add_epi8(x, _mm_srli_si128(x, 4)); x = _mm_add_epi8(x, _mm_srli_si128(x, 8)); // x == 3, 4, 11, 11, 15, 16, 22, 25, 28, 29, 36, 36, 40, 41, 47, 50 </code></pre> <p>If you have longer sums, the dependencies could be hidden by exploiting instruction level parallelism and taking advantage of instruction reordering.</p> <p>Edit: something like</p> <pre><code>__m128i x0 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i x1 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i x2 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i x3 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i mask = _mm_set_epi8(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0); x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 1)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 1)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 1)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 1)); x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 2)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 2)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 2)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 2)); x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 4)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 4)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 4)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 4)); x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 8)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 8)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 8)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 8)); x1 = _mm_add_epi8(_mm_shuffle_epi8(x0, mask), x1); x2 = _mm_add_epi8(_mm_shuffle_epi8(x1, mask), x2); x3 = _mm_add_epi8(_mm_shuffle_epi8(x2, mask), x3); </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload