Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>After the comparison, you have an array of 8 booleans represented by <code>0xff</code> or <code>0x00</code>. The reason SIMD comparisons (on any architecture) produce those values is to make them useful for a bit-mask operation (and/or bit-select in NEON's case) so you can turn the result into an arbitrary value quickly, without a multiply.</p> <p>So rather than reducing them to <code>1</code> or <code>0</code> and shifting them about, you'll find it easier to mask them with the constant <code>0x8040201008040201</code>. Then each lane contains the bit corresponding to its position in the final result. You can pre-load the constant into another register (I'll use <code>d3</code>).</p> <pre><code>VAND d0, d2, d3 </code></pre> <p>Then, to combine the results, you can use <code>VPADD</code> (instead of <code>OR</code>), which will combine adjacent pairs of lanes, <code>d0[0] = d0[0] + d0[1]</code>, <code>d0[1] = d0[2] + d0[3]</code>, etc... Since the bit patterns do not overlap there is no carry and add works just as well as or. Also, because the output is half as large as the input we have to fill in the second half with junk. I've used a second copy of <code>d0</code> for that.</p> <p>You'll need to do the add three times to get all columns combined.</p> <pre><code>VPADD.u8 d0, d0, d0 VPADD.u8 d0, d0, d0 VPADD.u8 d0, d0, d0 </code></pre> <p>and now the result will now be in <code>d0[0]</code>.</p> <p>As you can see, <code>d0</code> has room for seven more results; and some lanes of the <code>VPADD</code> operations have been working with junk data. It would be better if you could fetch more data at once, and feed that additional work in as you go so that none of the arithmetic is wasted.</p> <hr> <p><strong>EDIT</strong></p> <p>Supposing the loop is unrolled four times; with results in <code>d4</code>, <code>d5</code>, <code>d6</code>, and <code>d7</code>; the constant mentioned earlier should be loaded into, say, <code>d30</code> and <code>d31</code>, and then some <code>q</code> register arithmetic can be used:</p> <pre><code>VAND q0, q2, q15 VAND q1, q3, q15 VPADD.u8 d0, d0, d1 VPADD.u8 d2, d2, d3 VPADD.u8 d0, d0, d2 VPADD.u8 d0, d0, d0 </code></pre> <p>With the final result in d0[0..3], or simply the 32-bit value in d0[0].</p> <p>There seem to be lots of registers free to unroll it further, but I don't know how many of those you'll use up on other calculations.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload