Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Without knowing anything else about your problem or your current implementation, one (somewhat) easy way to improve performance (to some extent) is to manually prefetch the values that your "sum" function is going to operate on.</p> <p>Ignoring architectural and compiler nuances for now, manual prefetching could look like this:</p> <pre><code>SmallStruct values [value_count] = {/*whatever*/}; int indices [index_count] = {/*whatever*/}; ... SmallStruct v = values[indices[0]]; for (int i = 1; i &lt; index_count; ++i) { SmallStruct v_next = values[indices[i]]; DoSomethingWith (v); // Note the *v* v = v_next; // You don't want to copy, but this is the simplest form } DoSomethingWith (v); // Do the final item </code></pre> <p>The above is the simplest possible form of prefetching. You can unroll the loop a little bit to avoid the copying mentioned above, and also you probably want to do more than a single prefetch.</p> <p>This optimization works because most (all?) modern architectures can have more than one memory request in flight, which means that those requests are overlapped and the average waiting time for those (presumably uncached) requests is divided by their concurrency (which is a good thing!) So, it wouldn't matter how many unused cache lines you have; <em>the important factor is the number of concurrent memory reads the memory system can sustain at any given time</em>.</p> <p><strong>A Note on the Effect of Cache Lines</strong></p> <p>The above (admittedly simplistic) code ignores two very important facts: that the entire <code>SmallStruct</code> cannot be read in one memory access (from CPU's point of view) which is a bad thing, and that memory is always read in units of cache lines (64 or 128 bytes, these days) anyways, which is very good!</p> <p>So, instead of trying to read the entire <code>values[indices[i]]</code> into <code>v_next</code>, we can just read one single byte, and assuming the <code>values</code> array is properly aligned, a significant amount of memory (one full cache line) will be loaded and at hand for eventual processing.</p> <p>Two important points:</p> <ol> <li>If your <code>SmallStruct</code> is not in fact small and won't fit entirely in a cache line, you must rearrange its members to make sure that the parts of it that is required in <code>DoSomethingWith()</code> are contiguous and packed and fit in one cache line. If they still don't fit, you should consider separating your algorithm into two or more passes, each operating on data that fits in one cache line.</li> <li>If you just read one byte (or one word, or whatever) from the next value you'll be accessing, make sure that the compiler doesn't optimize that read away!</li> </ol> <p><strong>Alternate Implementations</strong></p> <p>The second point above can be expressed in code, like this:</p> <pre><code>touch (&amp;values[indices[0]]); for (int i = 0; i &lt; index_count; ++i) { if (i + 1 &lt; index_count) touch (&amp;values[indices[i + 1]]); DoSomethingWith (values[indices[i]]); } </code></pre> <p>The <code>touch()</code> function is semantically like this (although the implementation would probably be more involved.)</p> <pre><code>void touch (void * p) { char c = *(char *)p; } </code></pre> <p>To prefetch more than one value, you'd do something like this: (<em>Update</em>: I changed my code to (I believe) a better implementation.)</p> <pre><code>const int PrefetchCount = 3; // Get the ball rolling... for (int j = 0; j &lt; PrefetchCount; ++j) touch (&amp;values[indices[j]]); for (int i = 0; i &lt; index_count; ++i) { if (i + PrefetchCount &lt; index_count) touch (&amp;values[indices[i + PrefetchCount]]); DoSomethingWith (values[indices[i]]); } </code></pre> <p>Again, note that all the implementations above are very simple and simplistic. Also, if you prefetch too much, you can blow your L1 cache and your performance with it.</p> <p><strong>Doing the Actual Prefetch</strong></p> <p>The x86-64 CPU has an instruction that you use to ask the CPU to prefetch a cache-line-worth of memory data into its caches. Actually, using this instruction you give a <em>hint</em> to the CPU that that particular memory location is going to be used by your application and the CPU will try to bring it into cache. If you do this soon enough, the data will be ready by the time you need it and your computations won't stall.</p> <p>The instruction is <code>PREFETCH*</code> and you can use compiler-specific intrinsics instead of resorting to assembly. These intrinsics are called <code>_mm_prefetch</code> for Microsoft and Intel C++ compilers, and <code>__builtin_prefetch</code> on GCC. (If you ended up using this, just remember that you want the lowest level of prefetching, i.e. <code>T0</code>.)</p> <p>Note that these go into the implementation of the <code>touch</code> function I used above.</p> <p>I know of no library that does this in a reusable way. Also, I have no familiarity with C# libraries to know whether these are available there or not.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload