Note that there are some explanatory texts on larger screens.

plurals
  1. POZGEMM using SSE, not giving speedup
    primarykey
    data
    text
    <p>I'm performing a matrix matrix multiplication of complex doubles (ZGEMM) and thought using SSE will help, but its not, in fact its slowing the code down. I wanted to know if, maybe, its because its memory bound?</p> <p>Heres the pseudocode:</p> <p>For multiplying two complex doubles I use the following, as proposed by intel (assuming real and complex are stored contiguously):</p> <p>If M=(a+ib) and IN= (c+id):</p> <pre><code>M1 = _mm_loaddup_pd(&amp;M[0]);//M1-&gt;|a|a| I1 = _mm_load_pd(&amp;IN);//I1-&gt;|d|c| T1 = _mm_mul_pd(M1,I1);//T1-&gt;|a*d|a*c| M1 = _mm_loaddup_pd(&amp;M[1]);//M1-&gt;|b|b| I1 = _mm_shuffle_pd(I1,I1,1);//I1-&gt;|c|d| I1 = _mm_mul_pd(M1,I1);//I1-&gt;|b*c|b*d| T1 = _mm_addsub_pd(T1,I1); //T1-&gt; |ad+bc|ac-bd| </code></pre> <p>Thus T1 stores the result of the complex multiplication.</p> <p>This is the matrix multiplication(<code>C[i][j] += A[i][k]*B[k][j]</code>):</p> <pre><code>/*Assumes real and imaginary elements are stored contiguously*/ /*Used loop order: ikj for better cache locality and used 2*2 block matrix mult*/ for(i=0;i&lt;N;i+=2){ for(k=0;k&lt;N;k+=2){ /*Perform the _mm_loaddup() part here for A[i][k],A[i][k+1],A[i+1][k],A[i+1][k+1] since im blocking for 2*2 matrix mult i.e will load duplicates of 8 double values into 8 SIMD registers here*/ A00r = _mm_loaddup_pd(&amp;A[(i*N+k)*2+0]); A00i = _mm_loaddup_pd(&amp;A[(i*N+k)*2+1]); A01r = _mm_loaddup_pd(&amp;A[(i*N+k)*2+2]); A01i = _mm_loaddup_pd(&amp;A[(i*N+k)*2+3]); A10r = _mm_loaddup_pd(&amp;A[((i+1)*N+k)*2+0]); A10i = _mm_loaddup_pd(&amp;A[((i+1)*N+k)*2+1); A11r = _mm_loaddup_pd(&amp;A[((i+1)*N+k)*2+2); A11i = _mm_loaddup_pd(&amp;A[((i+1)*N+k)*2+2); for(j=0;j&lt;N;j+=2){ double out[8] = {0,0,0,0,0,0,0,0}; op00=op01=op10=op11=_mm_setzero_pd(); B00 = _mm_loadu_pd(&amp;B[(k*N+j)*2]); B00_r = _mm_shuffle_pd(B00,B00,1); B01 = _mm_loadu_pd(&amp;B[(k*N+j+1)*2]); B01_r = _mm_shuffle_pd(B01,B01,1); /*Perform A[i][k]*B[k][j], A[i][k]*B[k][j+1], A[i+1][k]*B[k][j], A[i+1][k]*B[k][j+1] and assign it to op00,op01,op10,op11 respectively -&gt; takes 8 _mm_mul_pd() and 4 _mm_addsub_pd()*/ T1 = _mm_mul_pd(A00r,B00); T2 = _mm_mul_pd(A00i,B00_r); op00 = _mm_addsub_pd(T1,T2); T1 = _mm_mul_pd(A00r,B01); T2 = _mm_mul_pd(A00i,B01_r); op01 = _mm_addsub_pd(T1,T2); T1 = _mm_mul_pd(A10r,B00); T2 = _mm_mul_pd(A10i,B00_r); op10 = _mm_addsub_pd(T1,T2); T1 = _mm_mul_pd(A10r,B01); T2 = _mm_mul_pd(A10i,B01_r); op11 = _mm_addsub_pd(T1,T2); B00 = _mm_loadu_pd(&amp;B[((k+1)*N+j)*2]); B00_r = _mm_shuffle_pd(B00,B00,1); B01 = _mm_loadu_pd(&amp;B[((k+1)*N+j+1)*2]); B00_r = _mm_shuffle_pd(B01,B01,1); /*Perform A[i][k+1]*B[k+1][j],A[i][k+1]*B[k+1][j+1],A[i+1][k+1]*B[k+1][j],A[i+1][k+1]*B[k+1][j+1] and add it to op00,op01,op10,op11 respectively-&gt; takes 8 _mm_mul_pd(), 4 _mm_add_pd(), 4 _mm_addsub_pd()*/ T1 = _mm_mul_pd(A01r,B10); T2 = _mm_mul_pd(A01i,B10_r); op00 = _mm_add_pd(op00,_mm_addsub_pd(T1,T2)); T1 = _mm_mul_pd(A01r,B11); T2 = _mm_mul_pd(A01i,B11_r); op01 = _mm_add_pd(op01,_mm_addsub_pd(T1,T2)); T1 = _mm_mul_pd(A11r,B10); T2 = _mm_mul_pd(A11i,B10_r); op10 = _mm_add_pd(op10,_mm_addsub_pd(T1,T2)); T1 = _mm_mul_pd(A11r,B11); T2 = _mm_mul_pd(A11i,B11_r); op11 = _mm_add_pd(op11,_mm_addsub_pd(T1,T2)); /*Store op00,op01,op10,op11 into out[0],out[2],out[4] and out[6] -&gt; 4 stores*/ _mm_storeu_pd(&amp;out[0],op00); _mm_storeu_pd(&amp;out[2],op01); _mm_storeu_pd(&amp;out[4],op10); _mm_storeu_pd(&amp;out[6],op11); /*Perform the following 8 operations*/ C[(i*N+j)*2+0] += out[0]; C[(i*N+j)*2+1] += out[1]; . . . C[((i+1)*N+j)*2+3] += out[7]; } } } </code></pre> <p>The L1 cache is of 32KB, so I used cache blocking too (tile size of 16*16 which makes the working set size to be 12KB(3*2^4*2^4*2^3*2)) but it didn't help much. I'm only getting about 50% of the theoretical peak performance. Any pointers on how I could improve this?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload