Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Taking into account the suggestions offered, I've arrived at this solution. Also, I fixed a bug in my original implementation that caused a problem with rectangular images.</p> <p>The suggestion of rotating firstly by 90 degrees (using an affine transformation and threading that, and then rotating by a smaller degree proved to be slower just from having to iterate over the matrix twice). Of course, many factors come in to play on that, and most likely memory bandwidth causes things to be more skewed. So, for the machine I'm testing and optimizing for, this solution proved to be the best I could offer.</p> <p>Using 16x16 tiles:</p> <pre><code>class DoRotate { const double sina; const double cosa; const double xHelper; const double yHelper; const int rowSpan; const int colSpan; mutable fcomplex *destData; const fcomplex *srcData; const float *offsetsX; const float *offsetsY; __m128 dupOffsetsX; __m128 dupOffsetsY; public: void operator() ( const tbb::blocked_range2d&lt;size_t, size_t&gt; r ) const { float xOffset; float yOffset; int lineOffset; __m128i srcXints; __m128i srcYints; __m128 dupXOffset; __m128 dupYOffset; for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row ) { const size_t colBegin = r.cols().begin(); xOffset = -(row * sina) + xHelper + (cosa * colBegin); yOffset = (row * cosa) + yHelper + (sina * colBegin); lineOffset = ( row * colSpan ); //- all col values are offsets of this row for( size_t col = colBegin; col != r.cols().end(); col+=4, xOffset += (4 * cosa), yOffset += (4 * sina) ) { dupXOffset = _mm_load1_ps(&amp;xOffset); //- duplicate the x offset 4 times into a 4 float field dupYOffset = _mm_load1_ps(&amp;yOffset); //- duplicate the y offset 4 times into a 4 float field srcXints = _mm_cvttps_epi32( _mm_add_ps( dupOffsetsX, dupXOffset ) ); srcYints = _mm_cvttps_epi32( _mm_add_ps( dupOffsetsY, dupYOffset ) ); if( srcXints.m128i_i32[0] &gt;= 0 &amp;&amp; srcXints.m128i_i32[0] &lt; colSpan &amp;&amp; srcYints.m128i_i32[0] &gt;= 0 &amp;&amp; srcYints.m128i_i32[0] &lt; rowSpan ) { destData[col + lineOffset] = srcData[srcXints.m128i_i32[0] + ( srcYints.m128i_i32[0] * colSpan )]; } if( srcXints.m128i_i32[1] &gt;= 0 &amp;&amp; srcXints.m128i_i32[1] &lt; colSpan &amp;&amp; srcYints.m128i_i32[1] &gt;= 0 &amp;&amp; srcYints.m128i_i32[1] &lt; rowSpan ) { destData[col + 1 + lineOffset] = srcData[srcXints.m128i_i32[1] + ( srcYints.m128i_i32[1] * colSpan )]; } if( srcXints.m128i_i32[2] &gt;= 0 &amp;&amp; srcXints.m128i_i32[2] &lt; colSpan &amp;&amp; srcYints.m128i_i32[2] &gt;= 0 &amp;&amp; srcYints.m128i_i32[2] &lt; rowSpan ) { destData[col + 2 + lineOffset] = srcData[srcXints.m128i_i32[2] + ( srcYints.m128i_i32[2] * colSpan )]; } if( srcXints.m128i_i32[3] &gt;= 0 &amp;&amp; srcXints.m128i_i32[3] &lt; colSpan &amp;&amp; srcYints.m128i_i32[3] &gt;= 0 &amp;&amp; srcYints.m128i_i32[3] &lt; rowSpan ) { destData[col + 3 + lineOffset] = srcData[srcXints.m128i_i32[3] + ( srcYints.m128i_i32[3] * colSpan )]; } } } } DoRotate( const double pass_sina, const double pass_cosa, const double pass_xHelper, const double pass_yHelper, const int pass_rowSpan, const int pass_colSpan, const float *pass_offsetsX, const float *pass_offsetsY, fcomplex *pass_destData, const fcomplex *pass_srcData ) : sina(pass_sina), cosa(pass_cosa), xHelper(pass_xHelper), yHelper(pass_yHelper), rowSpan(pass_rowSpan), colSpan(pass_colSpan), destData(pass_destData), srcData(pass_srcData) { dupOffsetsX = _mm_load_ps(pass_offsetsX); //- load the offset X array into one aligned 4 float field dupOffsetsY = _mm_load_ps(pass_offsetsY); //- load the offset X array into one aligned 4 float field } }; </code></pre> <p>and then to call the code:</p> <pre><code>double sina = sin(radians); double cosa = cos(radians); double centerX = (colSpan) / 2; double centerY = (rowSpan) / 2; //- Adding .5 for rounding to avoid periodicity const double xHelper = centerX - (centerX * cosa) + (centerY * sina) + .5; const double yHelper = centerY - (centerX * sina) - (centerY * cosa) + .5; float *offsetsX = (float *)_aligned_malloc( sizeof(float) * 4, 16 ); offsetsX[0] = 0.0f; offsetsX[1] = cosa; offsetsX[2] = cosa * 2.0; offsetsX[3] = cosa * 3.0; float *offsetsY = (float *)_aligned_malloc( sizeof(float) * 4, 16 ); offsetsY[0] = 0.0f; offsetsY[1] = sina; offsetsY[2] = sina * 2.0; offsetsY[3] = sina * 3.0; //- tiled approach. Works better, but not by much. A little more stays in cache tbb::parallel_for( tbb::blocked_range2d&lt;size_t, size_t&gt;( 0, rowSpan, 16, 0, colSpan, 16 ), DoRotate( sina, cosa, xHelper, yHelper, rowSpan, colSpan, offsetsX, offsetsY, (fcomplex *)pDestData, (fcomplex *)pSrcData ) ); _aligned_free( offsetsX ); _aligned_free( offsetsY ); </code></pre> <p>I'm in no way 100% positive this is the best answer. But, this is the best I was capable of offering. So, I figured I'd pass my solution on to the community.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload