Note that there are some explanatory texts on larger screens.

plurals
  1. POextremely slow program from using AVX instructions
    text
    copied!<p>I'm trying to write a geometric mean sqrt(a * b) using AVX intrinsics, but it runs slower than molasses!</p> <pre class="lang-c prettyprint-override"><code>int main() { int count = 0; for (int i = 0; i &lt; 100000000; ++i) { __m128i v8n_a = _mm_set1_epi16((++count) % 16), v8n_b = _mm_set1_epi16((++count) % 16); __m128i v8n_0 = _mm_set1_epi16(0); __m256i temp1, temp2; __m256 v8f_a = _mm256_cvtepi32_ps(temp1 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_a, v8n_0)), _mm_unpackhi_epi16(v8n_a, v8n_0), 1)), v8f_b = _mm256_cvtepi32_ps(temp2 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_b, v8n_0)), _mm_unpackhi_epi16(v8n_b, v8n_0), 1)); __m256i v8n_meanInt32 = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_mul_ps(v8f_a, v8f_b))); __m128i v4n_meanLo = _mm256_castsi256_si128(v8n_meanInt32), v4n_meanHi = _mm256_extractf128_si256(v8n_meanInt32, 1); g_data[i % 8] = v4n_meanLo; g_data[(i + 1) % 8] = v4n_meanHi; } return 0; } </code></pre> <p>The key to this mystery is that I'm using Intel ICC 11 and it's only slow when compiling with icc -O3 sqrt.cpp. If I compile with icc -O3 -xavx sqrt.cpp, then it runs 10x faster.</p> <p>But it's not obvious if there's emulation happening because I used performance counters and the number of instructions executed for both versions is roughly 4G:</p> <pre><code> Performance counter stats for 'a.out': 16867.119538 task-clock # 0.999 CPUs utilized 37 context-switches # 0.000 M/sec 8 CPU-migrations # 0.000 M/sec 281 page-faults # 0.000 M/sec 35,463,758,996 cycles # 2.103 GHz 23,690,669,417 stalled-cycles-frontend # 66.80% frontend cycles idle 20,846,452,415 stalled-cycles-backend # 58.78% backend cycles idle 4,023,012,964 instructions # 0.11 insns per cycle # 5.89 stalled cycles per insn 304,385,109 branches # 18.046 M/sec 42,636 branch-misses # 0.01% of all branches 16.891160582 seconds time elapsed </code></pre> <p>-----------------------------------with -xavx----------------------------------------</p> <pre><code> Performance counter stats for 'a.out': 1288.423505 task-clock # 0.996 CPUs utilized 3 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 279 page-faults # 0.000 M/sec 2,708,906,702 cycles # 2.102 GHz 1,608,134,568 stalled-cycles-frontend # 59.36% frontend cycles idle 798,177,722 stalled-cycles-backend # 29.46% backend cycles idle 3,803,270,546 instructions # 1.40 insns per cycle # 0.42 stalled cycles per insn 300,601,809 branches # 233.310 M/sec 15,167 branch-misses # 0.01% of all branches 1.293986790 seconds time elapsed </code></pre> <p>Is there some kind of processor internal emulation going on? I know for denormal numbers, adds end up being 64 times slower than normal.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload