Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Yes, it's possible to use MMX/SSE's <code>pmullw</code> instruction (intrinsic function: <code>_mm_mullo_pi16</code>) to do per-element shifts.</p> <p>The basic idea is to extract alternating 7-bit elements with an AND instruction and perform the <code>pmullw</code> to shift the elements into place. This will accomplish the task for half of the elements, so the process will need to be repeated with a couple of extra shifts.</p> <pre><code>#include &lt;stdio.h&gt; #include &lt;stdint.h&gt; #include &lt;mmintrin.h&gt; __m64 f(__m64 input) { static const __m64 mask = (__m64) 0xfe03f80fe03f80UL; static const __m64 multiplier = (__m64) 0x0080002000080002UL; __m64 t0 = _mm_and_si64 (input, mask); __m64 t1 = _mm_and_si64 (_mm_srli_si64 (input, 7), mask); t0 = _mm_mullo_pi16 (t0, multiplier); t1 = _mm_mullo_pi16 (t1, multiplier); __m64 res = _mm_or_si64 (t0, _mm_slli_si64 (t1, 8)); /* set most significant bits, except for in most significant byte */ return _mm_or_si64 (res, (__m64) 0x0080808080808080UL); } int main(int argc, char *argv[]) { int i; typedef union { __m64 m64; unsigned char _8x8[8]; } type_t; /* 0x7f7e7c7870608080 = {127, 63, 31, 15, 7, 3, 2, 1, 0} */ type_t res0 = { .m64 = f((__m64) 0x7f7e7c7870608080UL) }; for (i = 0; i &lt; 8; i++) { printf("%3u ", res0._8x8[i]); } puts(""); return 0; } </code></pre> <p>The <code>mask</code> extracts alternating 7-bit elements. The <code>multiplier</code> is a constant which allows us to specify per-element shifts. It's derived from looking at the masked input:</p> <pre><code>00000000|dcbaZYX0|000000PO|NMLKJ000|0000BAzy|xwv00000|00nmlkji|h0000000 </code></pre> <p>and realizing that</p> <pre><code>00000000|dcbaZYX0 needs to be shifted by 7 (or multiplied by 2^7, 128, 0x0080) 000000PO|NMLKJ000 needs to be shifted by 5 (or multiplied by 2^5, 32, 0x0020) 0000BAzy|xwv00000 needs to be shifted by 3 (or multiplied by 2^3, 8, 0x0008) 00nmlkji|h0000000 needs to be shifted by 1 (or multiplied by 2^1, 2, 0x0002) </code></pre> <p>This function writes 8-bytes at a time (instead of the 9-bytes your 9 7-bit elements would unpack to), so you'll have to advance the source pointer by only 7-bytes after each iteration. Because of this, a conversion to SSE2 is a bit more complicated.</p> <p>I don't think it's possible to use a different mask and multiplier for <code>t1</code> in order to avoid the shifts, since <code>t1</code>'s elements will cross 16-bit boundaries, which will prevent <code>pmullw</code> from working. But, it still might be possible to optimize somehow.</p> <p>I haven't benchmarked this, but I suspect it's significantly faster than your scalar version. If you benchmark it, please post the results. I'd be very interested to see them.</p> <p>All in all, the algorithm comes out to be 2 shifts, 2 ors, 2 ands, and two multiplies (and a few moves) to generate 8-bytes.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload