Note that there are some explanatory texts on larger screens.

plurals
  1. PONested for-loop debugging with matrix algorithm and constants.
    text
    copied!<p>This set of nested for loops works correctly for values of M=64 and N=64, but does not work when I make M=128 and N=64. I have another program that checks for correct values for the matrix multiply. Intuitively it seems like it should still work, but gives me the wrong answer.</p> <pre><code>for(int m=64;m&lt;=M;m+=64){ for(int n=64;n&lt;=N;n+=64){ for(int i = m-64; i &lt; m; i+=16){ float *A_column_start, *C_column_start; __m128 c_1, c_2, c_3, c_4, a_1, a_2, a_3, a_4, mul_1, mul_2, mul_3, mul_4, b_1; int j, k; for(j = m-64; j &lt; m; j++){ //Load 16 contiguous column aligned elements from matrix C in //c_1-c_4 registers C_column_start = C+i+j*M; c_1 = _mm_loadu_ps(C_column_start); c_2 = _mm_loadu_ps(C_column_start+4); c_3 = _mm_loadu_ps(C_column_start+8); c_4 = _mm_loadu_ps(C_column_start+12); for (k=n-64; k &lt; n; k+=2){ //Load 16 contiguous column aligned elements from matrix A to //the a_1-a_4 registers A_column_start = A+k*M; a_1 = _mm_loadu_ps(A_column_start+i); a_2 = _mm_loadu_ps(A_column_start+i+4); a_3 = _mm_loadu_ps(A_column_start+i+8); a_4 = _mm_loadu_ps(A_column_start+i+12); //Load a value to resgister b_1 to act as a "B" or ("A^T") //element to multiply against the A matrix b_1 = _mm_load1_ps(A_column_start+j); mul_1 = _mm_mul_ps(a_1, b_1); mul_2 = _mm_mul_ps(a_2, b_1); mul_3 = _mm_mul_ps(a_3, b_1); mul_4 = _mm_mul_ps(a_4, b_1); //Add together all values of the multiplied A and "B" //(or "A^T") matrix elements c_4 = _mm_add_ps(c_4, mul_4); c_3 = _mm_add_ps(c_3, mul_3); c_2 = _mm_add_ps(c_2, mul_2); c_1 = _mm_add_ps(c_1, mul_1); //Move over one column in A, and load the next 16 contiguous //column aligned elements from matrix A to the a_1-a_4 registers A_column_start+=M; a_1 = _mm_loadu_ps(A_column_start+i); a_2 = _mm_loadu_ps(A_column_start+i+4); a_3 = _mm_loadu_ps(A_column_start+i+8); a_4 = _mm_loadu_ps(A_column_start+i+12); //Load a value to resgister b_1 to act as a "B" or "A^T" //element to multiply against the A matrix b_1 = _mm_load1_ps(A_column_start+j); mul_1 = _mm_mul_ps(a_1, b_1); mul_2 = _mm_mul_ps(a_2, b_1); mul_3 = _mm_mul_ps(a_3, b_1); mul_4 = _mm_mul_ps(a_4, b_1); //Add together all values of the multiplied A and "B" or //("A^T") matrix elements c_4 = _mm_add_ps(c_4, mul_4); c_3 = _mm_add_ps(c_3, mul_3); c_2 = _mm_add_ps(c_2, mul_2); c_1 = _mm_add_ps(c_1, mul_1); } //Store the added up C values back to memory _mm_storeu_ps(C_column_start, c_1); _mm_storeu_ps(C_column_start+4, c_2); _mm_storeu_ps(C_column_start+8, c_3); _mm_storeu_ps(C_column_start+12, c_4); } } } }} </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload