Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>This paper <a href="http://arxiv.org/abs/0805.1598" rel="nofollow noreferrer">"A Simple In-Place Algorithm for In-Shuffle"</a> describes how to transpose matrix of 2*N and gives a hint how to do it for other cases, so 3*N may be also possible. <a href="https://stackoverflow.com/a/23529396/1009831">This answer to other question</a> shows that it is indeed possible.</p> <p>Or use an algorithm which writes each value to its transposed place, then does the same for the value from that place, and so on until cycle is connected. Flag processed values in a bit vector. And continue until this vector is all 1s.</p> <p>Both algorithms are not cache-friendly. Probably some clever use of PREFETCH instruction can improve this.</p> <p><strong>Edit:</strong></p> <p>C++, RGB to single planes, not optimized:</p> <pre><code>#include &lt;iostream&gt; #include &lt;bitset&gt; #include &lt;vector&gt; enum {N = 8}; void transpose(std::vector&lt;char&gt;&amp; a) { std::bitset&lt;3*N&gt; b; for (int i = 1; i &lt; 3*N; ++i) { if (b[i]) continue; int ptr = i; int next; char nextVal = a[i]; do { next = ptr/3 + N*(ptr%3); char prevVal = nextVal; nextVal = a[next]; a[next] = prevVal; ptr = next; b[ptr] = true; } while (ptr != i); } } int main() { std::vector&lt;char&gt; a(3*N); for (int i = 0; i != 3*N; ++i) a[i] = i; transpose(a); for (int i = 0; i != 3*N; ++i) std::cout &lt;&lt; (int)a[i] &lt;&lt; std::endl; return 0; } </code></pre> <p>My original intent is to use a bit vector of size WIDTH<em>HEIGHT, which gives overhead of WIDTH</em>HEIGHT/8. But it is always possible to sacrifice speed for space. The bit vector may be of size WIDTH or HEIGHT or any desirable value, even 0. The trick is to maintain a pointer to the cell, before which all values are transposed. The bit vector is for cells, starting from this pointer. After it is all 1s, It is moved to next position, then all the algorithm steps are performed except actual data movement. And the bit vector is ready to continue transposing. This variant is O(N^2) instead of O(N).</p> <p><strong>Edit2:</strong></p> <p>PREFITCH optimization is not difficult to implement: just calculate indexes, invoke PREFETCH, and put indexes to a queue (ringbuffer), then get indexes from the queue and move data.</p> <p><strong>Edit3:</strong></p> <p>The idea of other algorithm, which is O(1) size, O(N*log(N)) time, is cache-friendly and may be faster than "cycle" algorithm (for image sizes &lt; 1Gb):</p> <ul> <li>Split N*3 matrix to several 3*3 matrices of char and transpose them</li> <li>Split the result to 3*3 matrices of char[3] and transpose them</li> <li>Continue while matrices size is less than the array size</li> <li>Now we have up to 3*2*log3(N) ordered pieces. Join them.</li> <li>First join pieces of equal size. Very simple "cycles" of length 4 may be used.</li> <li>Join unequal-sized pieces with reverse(reverse(a), reverse(b))</li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload