Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>First: this sort of sensitivity to data placement is unfortunately common. Some of us have written code that tries multiple different layouts</p> <p>The usual culprits for performance losses such as this are:</p> <ul> <li><p>branch mispredictions</p></li> <li><p>cache effects</p> <ul> <li><p>capacity misses (just plain too much data e.g. 1MB of data does not fot in a 32KB cache)</p></li> <li><p>cache conflicts (e.g. more than 4 addresses that are the same modulo 8K in a 4-way associative 32KB cache)</p></li> </ul></li> <li><p>DRAM effects</p> <ul> <li>DRAM page misses</li> </ul></li> </ul> <p>I am having trouble parsing what you say: what is MAXSIZE? You say 7*4KB... But you have 8 arrays, so I doubt that you are saying that MAXSIZE=1024. Are you saying that MAXSIZE is 7*1024? (* 4B / float?)</p> <p>Anyway: if MAXSIZE for each individual array is circa 28KB, then you are near cache size for many systems. In this case, I would suspect DRAM page effects - I would suspect that the good performing arrangement puts the most accessed array in a separate DRAM page.</p> <p>You don't say which performs better, but I would guess:</p> <pre><code>float amparray[maxsize]; //these two make the most change float timearray[maxsize]; //these two make the most change </code></pre> <p>eyeballing your code, timearray seems to be most accessed. If the performance is better with timearray second, and my guess about MAXSIZE is correct, then I would bet that it is DRAM page effects.</p> <p>Quick explanation: DRAMs have concepts of pages and banks. Not to be confused with OS pages. Eac DRAM chip, and hence each DIMM, has 4 or 8 internal banks. Each bank can have one open page. If you access data from the same page, same bank, it is fastest. If you access data frm the alread open page in a different bank, fast, but slower than same page same bank If you need a different page in the same bank, really slow. If you have a writeback cache the writeacks ocur almost at random, so you can get really bad page behavior.</p> <p>However, if I have guessed wrong about MAXSIZE, then probably a cache effect.</p> <p>RED FLAG: you say "I didn't put in things like stride". <em>Strides</em> are notorious for making data behave poorly in the cache. Caches are typically set asociative, meaning that they have what I call "resonance" - addresses that are the same modulo the resonance of the cache will map to the same set. If you have more such than the associativity, you wll thrash.</p> <p>Calculate the resonance as the cache size divided by the associativity. E.g. if you have a 32K 4-way associative cache, your resonance is 8K.</p> <p>Anyway... if you are only accessing things on a stride, then array placement can matter. E.g. say that you have a stride of 16. I.e accessing eements 0, 16, 32, 48 etc. If MAXSIZE was 7*1024, as I guessed above, then elements</p> <pre><code>float trigarray1[maxsize]; float trigarray2[maxsize]; float trigarray3[maxsize]; float trigarray4[maxsize]; float trigarray5[maxsize]; float temparray[maxsize]; float amparray[maxsize]; //these two make the most change float timearray[maxsize]; //these two make the most change </code></pre> <p>then the following arrays will conflict - their strided access patterns will map to the same sets:</p> <pre><code>trigarray1, trigarray5 trigarray2, temparray trigarray3, amparray trigarray4, timearray, </code></pre> <p>if you interchange amparray and timearray, then</p> <pre><code> trigarray3 will conflict with timearray and trigarray4 with amparray </code></pre> <p>trigarray4 and timarray seem to be the most used, so I am guessing, if you have a stride like 0, 16, 32, 348, or indeed any stride beginning with 0, then those two arrays conflicting are your problem.</p> <p>However, you might have different stride patterns: 0, 16, 32, 48 ... in one array, and 1,17,33,... in the other. Then different pairs of arrays would conflict.</p> <p>--</p> <p>I don't have enough info to diagnose your problem here.</p> <p>You might be able to do it yourself if you have access to good performance tools. </p> <p>E.g. on Intel processors, you could record what I call a cache miss profile record the ideally physical memory addresses, compute what sets they map to in the cache, and generate a histogram. If you see spikes, that's likely a problem. Similarly, you can generate DRAM page miss or bank miss profiles. I only mention Intel because I designed some of the hardware to enable this sort of performance measurement. The same sort of thing may be should be, available on ARM (if not, maybe I could get rich selling tools to do it... :-) ).</p> <p>If these are the problem how can you fix it?</p> <p>Well, by trying different placements, as you explain above. This can help both strides (cache set conflicts) and DRAM page problems.</p> <p>If strides are a problem, you might try making the array sizes a bit dfferent - MAXSIZE + 4, MAXSIZE 8, etc. This can effectvely offset the strides. (It's common in supercomputer codes to see arrays of size 255 or 257, for same reasaon of offsetting strided access patterns so as not to conflict.)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload