Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.</p> <p>I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.</p> <p>Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop. </p> <pre><code>input[i+1] [j] [k]; </code></pre> <p>is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.</p> <pre><code>*inputIplusOneJK++ </code></pre> <p>is asking for one add and a memory reference.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload