Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.</p> <p>For example, if you have a cache line of 32-bytes and an <code>int</code> is 4 bytes, you can fit a 8x8 <code>int</code> matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.</p> <p>If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.</p> <p>A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.</p> <p>So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:</p> <pre><code>int sum_diagonal_difference(int array[512][512], int block_size) { int i,j, block_i, block_j,result=0; // sum diagonal blocks for (block_i= 0; block_i&lt;512; block_i+= block_size) for (block_j= block_i + block_size; block_j&lt;512; block_j+= block_size) for(i=0; i&lt;block_size; i++) for(j=0; j&lt;block_size; j++) result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]); result+= result; // sum diagonal for (int block_offset= 0; block_offset&lt;512; block_offset+= block_size) { for (i= 0; i&lt;block_size; ++i) { for (j= i+1; j&lt;block_size; ++j) { int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]); result+= value + value; } } } return result; } </code></pre> <p>You should experiment with various values for <code>block_size</code>. On my machine, <code>8</code> lead to the biggest speed up (2.5x) compared to a <code>block_size</code> of 1 (and ~5x compared to the original iteration over the entire matrix). The <code>block_size</code> should ideally be <code>cache_line_size_in_bytes/sizeof(int)</code>.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload