StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:</p> <pre><code>#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; } </code></pre> <p>(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.</p> <p>Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like <code>d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]</code>). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)</p> <p>By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.</p> <p>Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:</p> <pre><code>SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5); SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(0,1); SWAP(1,2); SWAP(0,1); </code></pre> <p>and here is the insertion sort:</p> <pre><code>//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } } //Faster on x86, probably slower on ARM or similar: #define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; } static inline void sort6_insertion_sort_unrolled_v2(int * d){ int t; t = d[1]; ITER(0); t = d[2]; ITER(1); ITER(0); t = d[3]; ITER(2); ITER(1); ITER(0); t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0); t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0); </code></pre> <p>This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the <code>t = d[2]; ITER(1); ITER(0);</code> line in ARM assembly:</p> <pre><code> MOV r6, r2 CMP r6, r1 MOVLT r2, r1 MOVLT r1, r6 CMP r6, r0 MOVLT r1, r0 MOVLT r0, r6 </code></pre> <p>For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload