Note that there are some explanatory texts on larger screens.

plurals
  1. POOptimized Matrix Rotation - arbitrary angle about center of matrix
    primarykey
    data
    text
    <p>I'm trying to optimize the rotation of very large images, smallest is 4096x4096 or ~16 million pixels.</p> <p>Rotation is always about the center of image, and images are not necessarily always square but will always be a power of 2.</p> <p>I have access to MKL/TBB, where MKL is an optimized BLAS for my target platforms. I'm unfamiliar if this operation is in BLAS at all.</p> <p>My best attempts so far are around 17-25ms (very inconsistent for the same image size, which means I'm probably stomping all over cache) for 4096x4096 images. Matrices are 16 byte aligned.</p> <p>Now, the destination cannot be resized. So, clipping should and can occur. For instance, a square matrix rotated at 45 degrees will certainly clip at the corners and the value at that location should be zero.</p> <p>Right now, my best attempts use a tiled approach - no elegance is yet being put into tile sizes nor loop unrolling.</p> <p>Here is my algorithm as it stands using TBB - <a href="http://threadingbuildingblocks.org/" rel="nofollow noreferrer">http://threadingbuildingblocks.org/</a>:</p> <pre><code>//- cosa = cos of the angle //- sina = sin of angle //- for those unfamiliar with TBB, this is giving me blocks of rows or cols that //- are unique per thread void operator() ( const tbb::blocked_range2d&lt;size_t, size_t&gt; r ) const { double xOffset; double yOffset; int lineOffset; int srcX; int srcY; for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row ) { const size_t colBegin = r.cols().begin(); xOffset = -(row * sina) + xHelper + (cosa * colBegin); yOffset = (row * cosa) + yHelper + (sina * colBegin); lineOffset = ( row * rowSpan ); //- all col values are offsets of this row for( size_t col = colBegin; col != r.cols().end(); ++col, xOffset += cosa, yOffset += sina ) { srcX = xOffset; srcY = yOffset; if( srcX &gt;= 0 &amp;&amp; srcX &lt; colSpan &amp;&amp; srcY &gt;= 0 &amp;&amp; srcY &lt; rowSpan ) { destData[col + lineOffset] = srcData[srcX + ( srcY * rowSpan )]; } } } } </code></pre> <p>I make a call to this function as such:</p> <pre><code>double sina = sin(angle); double cosa = cos(angle); double centerX = (colSpan) / 2; double centerY = (rowSpan) / 2; //- Adding .5 for rounding const double xHelper = centerX - (centerX * cosa) + (centerY * sina) + .5; const double yHelper = centerY - (centerX * sina) - (centerY * cosa) + .5; tbb::parallel_for( tbb::blocked_range2d&lt;size_t, size_t&gt;( 0, rowSpan, 0, colSpan ), DoRotate( sina, cosa, xHelper, yHelper, rowSpan, colSpan, (fcomplex *)pDestData, (fcomplex *)pSrcData ) ); </code></pre> <p>fcomplex is just in house representation of complex numbers. It is defined as:</p> <pre><code>struct fcomplex { float real; float imag; }; </code></pre> <p>So, I want to do rotation of a matrix of complex values about it's center at an arbitrary angle for very large images as fast as possible.</p> <p>Update:</p> <p>Based on wonderful feedback, I've updated to this: Which is about a 40% increase. I'm wondering though if anything else can be done:</p> <pre><code>void operator() ( const tbb::blocked_range2d&lt;size_t, size_t&gt; r ) const { float xOffset; float yOffset; int lineOffset; __m128i srcXints; __m128i srcYints; __m128 dupXOffset; __m128 dupYOffset; for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row ) { const size_t colBegin = r.cols().begin(); xOffset = -(row * sina) + xHelper + (cosa * colBegin); yOffset = (row * cosa) + yHelper + (sina * colBegin); lineOffset = ( row * rowSpan ); //- all col values are offsets of this row for( size_t col = colBegin; col != r.cols().end(); col+=4, xOffset += dupOffsetsX.m128_f32[3], yOffset += dupOffsetsY.m128_f32[3] ) { dupXOffset = _mm_load1_ps(&amp;xOffset); //- duplicate the x offset 4 times into a 4 float field dupYOffset = _mm_load1_ps(&amp;yOffset); //- duplicate the y offset 4 times into a 4 float field srcXints = _mm_cvtps_epi32( _mm_add_ps( dupOffsetsX, dupXOffset ) ); srcYints = _mm_cvtps_epi32( _mm_add_ps( dupOffsetsY, dupYOffset ) ); if( srcXints.m128i_i32[0] &gt;= 0 &amp;&amp; srcXints.m128i_i32[0] &lt; colSpan &amp;&amp; srcYints.m128i_i32[0] &gt;= 0 &amp;&amp; srcYints.m128i_i32[0] &lt; rowSpan ) { destData[col + lineOffset] = srcData[srcXints.m128i_i32[0] + ( srcYints.m128i_i32[0] * rowSpan )]; } if( srcXints.m128i_i32[1] &gt;= 0 &amp;&amp; srcXints.m128i_i32[1] &lt; colSpan &amp;&amp; srcYints.m128i_i32[1] &gt;= 0 &amp;&amp; srcYints.m128i_i32[1] &lt; rowSpan ) { destData[col + 1 + lineOffset] = srcData[srcXints.m128i_i32[1] + ( srcYints.m128i_i32[1] * rowSpan )]; } if( srcXints.m128i_i32[2] &gt;= 0 &amp;&amp; srcXints.m128i_i32[2] &lt; colSpan &amp;&amp; srcYints.m128i_i32[2] &gt;= 0 &amp;&amp; srcYints.m128i_i32[2] &lt; rowSpan ) { destData[col + 2 + lineOffset] = srcData[srcXints.m128i_i32[2] + ( srcYints.m128i_i32[2] * rowSpan )]; } if( srcXints.m128i_i32[3] &gt;= 0 &amp;&amp; srcXints.m128i_i32[3] &lt; colSpan &amp;&amp; srcYints.m128i_i32[3] &gt;= 0 &amp;&amp; srcYints.m128i_i32[3] &lt; rowSpan ) { destData[col + 3 + lineOffset] = srcData[srcXints.m128i_i32[3] + ( srcYints.m128i_i32[3] * rowSpan )]; } } } } </code></pre> <p>Update 2: I put a solution below, taking into account suggestions I got as answers as well as fixing a bug when rotating rectangles.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload