Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Firstly you can speed up the original code a little by factoring out the multiply and getting rid of the branch:</p> <pre><code> int whiteCount = 0; for (int q = i; q &lt; i + windowHeight; q++) { const bool * const row = &amp;imageData[q * W]; for (int w = j; w &lt; j + windowWidth; w++) { whiteCount += row[w]; } } </code></pre> <p>(This assumes that <code>imageData[]</code> is truly binary, i.e. each element can only ever be 0 or 1.)</p> <p>Here is a simple NEON implementation:</p> <pre><code>#include &lt;arm_neon.h&gt; // ... int i, w; int whiteCount = 0; uint32x4_t v_count = { 0 }; for (q = i; q &lt; i + windowHeight; q++) { const bool * const row = &amp;imageData[q * W]; uint16x8_t vrow_count = { 0 }; for (w = j; w &lt;= j + windowWidth - 16; w += 16) // SIMD loop { uint8x16_t v = vld1q_u8(&amp;row[j]); // load 16 x 8 bit pixels vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts } for ( ; w &lt; j + windowWidth; ++w) // scalar clean up loop { whiteCount += row[j]; } v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts } // from 16 bit row counts // add 4 x 32 bit partial counts from SIMD loop to scalar total whiteCount += vgetq_lane_s32(v_count, 0); whiteCount += vgetq_lane_s32(v_count, 1); whiteCount += vgetq_lane_s32(v_count, 2); whiteCount += vgetq_lane_s32(v_count, 3); // total is now in whiteCount </code></pre> <p>(This assumes that <code>imageData[]</code> is truly binary, <code>imageWidth &lt;= 2^19</code>, and <code>sizeof(bool) == 1</code>.)</p> <hr> <p>Updated version for <code>unsigned char</code> and values of 255 for white, 0 for black:</p> <pre><code>#include &lt;arm_neon.h&gt; // ... int i, w; int whiteCount = 0; const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }; uint32x4_t v_count = { 0 }; for (q = i; q &lt; i + windowHeight; q++) { const uint8_t * const row = &amp;imageData[q * W]; uint16x8_t vrow_count = { 0 }; for (w = j; w &lt;= j + windowWidth - 16; w += 16) // SIMD loop { uint8x16_t v = vld1q_u8(&amp;row[j]); // load 16 x 8 bit pixels v = vandq_u8(v, v_mask); // mask out all but LS bit vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts } for ( ; w &lt; j + windowWidth; ++w) // scalar clean up loop { whiteCount += (row[j] == 255); } v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts } // from 16 bit row counts // add 4 x 32 bit partial counts from SIMD loop to scalar total whiteCount += vgetq_lane_s32(v_count, 0); whiteCount += vgetq_lane_s32(v_count, 1); whiteCount += vgetq_lane_s32(v_count, 2); whiteCount += vgetq_lane_s32(v_count, 3); // total is now in whiteCount </code></pre> <p>(This assumes that <code>imageData[]</code> is has values of 255 for white and 0 for black, and <code>imageWidth &lt;= 2^19</code>.)</p> <hr> <p><em>Note that all the above code is untested and may need some further work.</em></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload