Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>There is no much to optimize in the original pseudo-code. I see several small improvements here:</p> <ul> <li>loop may perform one iteration less (i.e. size-1),</li> <li>if 'use' is zero, you may break the loop early,</li> <li><code>use = (m == 0) &amp; (ml == 1)</code> probably may be simplified to <code>use = ~m &amp; ml</code>,</li> <li>if <code>~</code> is counted as separate operation, it would be better to use the inverted form : <code>use = m | ~ml</code>, <code>d = ~use .* dl + use .* d</code>, <code>use_r = [1 use(1:end-1)]</code>, <code>d = d .*use_r</code></li> </ul> <p>But it is possible to invent better algorithms. And the choice of algorithm depends on CPU resources used:</p> <ul> <li>Load-Store Unit, i.e. apply algorithm directly to memory words. Nothing can be done here until chipmakers add highly parallel SCATTER instruction to their instruction sets.</li> <li>SSE registers, i.e. algorithms working on entire 16 bytes of the registers. Algorithms like the proposed pseudo-code cannot help here because we already have various shuffle/permute instructions which make the work better. Using various compare instructions with PMOVMSKB, grouping the result by 4 bits and applying various shuffle instructions under switch/case (as described by LastCoder) is the best we can do.</li> <li>SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB.</li> <li>Integer registers, i.e. GPR registers or working simultaneously on several DWORD/QWORD parts of SSE/AVX registers (which allows to perform several independent compactions). The proposed pseudo-code applied to integer registers allows to compact binary subsets of any length (from 2 to 20 bits). Here is my algorithm, which is likely to perform better.</li> </ul> <p>C++, 64 bit, subset width = 8:</p> <pre><code>typedef unsigned long long ull; const ull h = 0x8080808080808080; const ull l = 0x0101010101010101; const ull end = 0xffffffffffffffff; // uncompacted bytes ull x = 0x0100802300887700; // set hi bit for zero bytes (see D.Knuth, volume 4) ull m = h &amp; ~(x | ((x|h) - l)); // bitmask for nonzero bytes m = ~(m | (m - (m&gt;&gt;7))); // tail zero bytes need no special treatment m |= (m - 1); while (m != end) { ull tailm = m ^ (m + 1); // bytes to be processed ull tailx = x &amp; tailm; // get the bytes tailm |= (tailm &lt;&lt; 8); // shift 1 byte at a time m |= tailm; // all processed bytes are masked x = (x ^ tailx) | (tailx &lt;&lt; 8); // actual byte shift } </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload