StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:</p> <pre><code>const int DOUBLE_EXP_SHIFT = 52; const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull; const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK; void unsafe_shl(double* d, int shift) { unsigned long long* i = (unsigned long long*)d; if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) { *i += (unsigned long long)shift << DOUBLE_EXP_SHIFT; } else if (*i) { *d *= (1 << shift); } } </code></pre> <p>EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:</p> <pre><code> double ds[0x1000]; for (int i = 0; i != 0x1000; i++) ds[i] = 1.2; clock_t t = clock(); for (int j = 0; j != 1000000; j++) for (int i = 0; i != 0x1000; i++) #if DOUBLE_SHIFT ds[i] *= 1 << 4; #else ((unsigned int*)&ds[i])[1] += 4 << 20; #endif clock_t e = clock(); printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC); </code></pre> <p>In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of</p> <pre><code>movupd xmm0,xmmword ptr [ecx] lea ecx,[ecx+10h] mulpd xmm0,xmm1 movupd xmmword ptr [ecx-10h],xmm0 </code></pre> <p>Versus 2.4 seconds otherwise, with an inner loop of:</p> <pre><code>add dword ptr [ecx],400000h lea ecx, [ecx+8] </code></pre> <p>Truly unexpected!</p> <p>EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:</p> <pre><code>movsd xmm1,mmword ptr [esp+eax*8+38h] mulsd xmm1,xmm0 movsd mmword ptr [esp+eax*8+38h],xmm1 inc eax </code></pre> <p>VC10 <em>without</em> /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... <strong>with 1/100th of the iterations!!</strong>, inner loop:</p> <pre><code>fld qword ptr [esp+eax*8+38h] inc eax fmul st,st(1) fstp qword ptr [esp+eax*8+30h] </code></pre> <p>I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload