Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.</p> <p>More details can be found here - <a href="http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers" rel="noreferrer">http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers</a></p> <p>For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing <code>j*16</code> (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with <code>j*64</code> (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)</p> <p>Update the questions with the new results and we can figure out if there's anything else here.</p> <hr> <p><em>EDIT1</em>: Ok, I ran the fixed code and got - </p> <pre><code>1 time: 1.321001 4 time: 1.321998 8 time: 1.336288 16 time: 1.324994 24 time: 1.319742 32 time: 1.330685 64 time: 1.536644 128 time: 1.536933 256 time: 1.669329 384 time: 1.592145 512 time: 2.036315 1024 time: 2.214269 2048 time: 2.407584 3072 time: 2.259108 4096 time: 2.584872 5120 time: 2.203696 6144 time: 2.335194 7168 time: 2.322517 8192 time: 5.554941 9216 time: 2.230817 </code></pre> <p>It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again. </p> <p>This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).</p> <p>Try replacing the <code>... &amp; size_mask</code> with <code>% steps[i]/sizeof(int)</code>, the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload