Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with <code>-O3 -mssse3</code>, ran them 10 times over 32MB of random data and found the averages.</p> <p>The first version uses a C loop to convert each pixel separately, using the <code>OSSwapInt32</code> function (which compiles to a <code>bswap</code> instruction with <code>-O3</code>).</p> <pre><code>void swap1(ARGB *orig, BGR *dest, unsigned imageSize) { unsigned x; for(x = 0; x &lt; imageSize; x++) { *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]); } } </code></pre> <p>The second method performs the same operation, but uses an inline assembly loop instead of a C loop.</p> <pre><code>void swap2(ARGB *orig, BGR *dest, unsigned imageSize) { asm ( "0:\n\t" "movl (%1),%%eax\n\t" "bswapl %%eax\n\t" "movl %%eax,(%0)\n\t" "addl $4,%1\n\t" "addl $3,%0\n\t" "decl %2\n\t" "jnz 0b" :: "D" (dest), "S" (orig), "c" (imageSize) : "flags", "eax" ); } </code></pre> <p>The third version is a modified version of <a href="https://stackoverflow.com/questions/6804101/fast-method-to-copy-memory-with-translation-argb-to-bgr-2-000-rep-bounty/6804399#6804399">just a poseur's answer</a>. I converted the built-in functions to the GCC equivalents and used the <code>lddqu</code> built-in function so that the input argument doesn't need to be aligned.</p> <pre><code>typedef uint8_t v16qi __attribute__ ((vector_size (16))); void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) { v16qi mask = __builtin_ia32_lddqu((const char[]){3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF}); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 16, dest += 12) { __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask)); } } </code></pre> <p>Finally, the fourth version is the inline assembly equivalent of the third.</p> <pre><code>void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) { int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};//{0xFF, 0xFF, 0xFF, 0xFF, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3}; asm ( "lddqu (%3),%%xmm1\n\t" "0:\n\t" "lddqu (%1),%%xmm0\n\t" "pshufb %%xmm1,%%xmm0\n\t" "movdqu %%xmm0,(%0)\n\t" "add $16,%1\n\t" "add $12,%0\n\t" "sub $4,%2\n\t" "jnz 0b" :: "r" (dest), "r" (orig), "r" (imagesize), "r" (mask) : "flags", "xmm0", "xmm1" ); } </code></pre> <p>On my 2010 MacBook Pro, 2.4 Ghz i5, 4GB RAM, these were the average times for each:</p> <pre> Version 1: 10.8630 milliseconds Version 2: 11.3254 milliseconds Version 3: 9.3163 milliseconds Version 4: 9.3584 milliseconds </pre> <p>As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.</p> <p>Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.</p> <pre> Average | Standard Deviation Brute force: 18.01956 ms | 1.22980 ms (6.8%) Version 1: 11.13120 ms | 0.81076 ms (7.3%) Version 2: 11.27092 ms | 0.66209 ms (5.9%) Version 3: 9.29184 ms | 0.27851 ms (3.0%) Version 4: 9.40948 ms | 0.32702 ms (3.5%) </pre> <p>Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.</p> <pre> Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845 Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601 Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936 Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156 Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314 </pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload