StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>The likely reason that 4b is faster than 4a is that it parallelizes better. From 4a:</p> <pre><code>mov bl, al; and bl, $01; // data dep (bl) mov [edx], bl; // data dep (bl) shr al, $01; mov bl, al; // data dep (al) and bl, $01; // data dep (bl) mov [edx + $01], bl; // data dep (bl) </code></pre> <p>Instructions marked "data dep" cannot begin executing until the previous instruction has finished, and I've written the registers that cause this data dependency. Modern CPUs are capable of starting an instruction before the last one has completed, if there is no dependency. But the way you've ordered these operations prevents this.</p> <p>In 4b, you have fewer data dependencies:</p> <pre><code>mov bl, al; and bl, $01; // data dep (bl) shr al, $01; mov [edx], bl; mov bl, al; and bl, $01; // data dep (bl) shr al, $01; mov [edx + $01], bl; </code></pre> <p>With this instruction ordering, fewer of the instructions depend on the previous instruction, so there is more opportunity for parallelism.</p> <p>I can't guarantee that this is the reason for the speed difference, but it is a likely candidate. Unfortunately it is hard to come across answers as absolute as the ones you are looking for; modern processors have branch predictors, multi-level caches, hardware pre-fetchers, and all sorts of other complexities that can make it difficult to isolate the reasons for performance differences. The best you can do is read a lot, perform experiments, and get familiar with the tools for taking good measurements.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload