Note that there are some explanatory texts on larger screens.

plurals
  1. POPerformance Optimization for Matrix Rotation
    primarykey
    data
    text
    <p>I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:</p> <p>In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as: Transpose: interchange elements M(i,j) and M(j,i) Exchange rows: Row i is exchanged with row N-1-i</p> <p>A example for matrix rotation(N is 3 instead of 32 for simplicity):</p> <pre><code>------- ------- |1|2|3| |3|6|9| ------- ------- |4|5|6| after rotate is |2|5|8| ------- ------- |7|8|9| |1|4|7| ------- ------- </code></pre> <p>A naive implementation is:</p> <pre><code>#define RIDX(i,j,n) ((i)*(n)+(j)) void naive_rotate(int dim, pixel *src, pixel *dst) { int i, j; for (i = 0; i &lt; dim; i++) for (j = 0; j &lt; dim; j++) dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)]; } </code></pre> <p>I come up with an idea by inner-loop-unroll. The result is:</p> <pre><code>Code Version Speed Up original 1x unrolled by 2 1.33x unrolled by 4 1.33x unrolled by 8 1.55x unrolled by 16 1.67x unrolled by 32 1.61x </code></pre> <p>I also get a code snippet from pastebin.com that seems can solve this problem:</p> <pre><code>void rotate(int dim, pixel *src, pixel *dst) { int stride = 32; int count = dim &gt;&gt; 5; src += dim - 1; int a1 = count; do { int a2 = dim; do { int a3 = stride; do { *dst++ = *src; src += dim; } while(--a3); src -= dim * stride + 1; dst += dim - stride; } while(--a2); src += dim * (stride + 1); dst -= dim * dim - stride; } while(--a1); } </code></pre> <p>After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.</p> <p>Here are the questions:</p> <ol> <li><p>In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?</p></li> <li><p>What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.</p></li> </ol> <p>EDIT:</p> <p>My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4</p> <p>Any advice will be highly appreciated!</p> <p>Kind regards!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload