Note that there are some explanatory texts on larger screens.

plurals
  1. POreduction with OpenMP with SSE/AVX
    primarykey
    data
    text
    <p>I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to:</p> <pre><code>inline float sum_scalar_openmp2(const float a[], const size_t N) { float sum = 0.0f; #pragma omp parallel { float sum_private = 0.0f; #pragma omp parallel for nowait for(int i=0; i&lt;N; i++) { sum_private += a[i]; } #pragma omp atomic sum += sum_private; } return sum; } </code></pre> <p>I got this idea from the follow link: <a href="http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause" rel="nofollow">http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause</a> But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the reduction with OpenMP and SSE like this:</p> <pre><code>#define ROUND_DOWN(x, s) ((x) &amp; ~((s)-1)) inline float sum_vector4_openmp(const float a[], const size_t N) { __m128 sum4 = _mm_set1_ps(0.0f); #pragma omp parallel { __m128 sum4_private = _mm_set1_ps(0.0f); #pragma omp for nowait for(int i=0; i &lt; ROUND_DOWN(N, 4); i+=4) { __m128 a4 = _mm_load_ps(a + i); sum4_private = _mm_add_ps(a4, sum4_private); } #pragma omp critical sum4 = _mm_add_ps(sum4_private, sum4); } __m128 t1 = _mm_hadd_ps(sum4,sum4); __m128 t2 = _mm_hadd_ps(t1,t1); float sum = _mm_cvtss_f32(t2); for(int i = ROUND_DOWN(N, 4); i &lt; N; i++) { sum += a[i]; } return sum; } </code></pre> <p>However, this function does not perform as well as I hope. I'm using Visual Studio 2012 Express. I know I can improve the performance a bit by unrolling the SSE load/add a few times but that still is less than I expect. </p> <p>I get much better performance by running over slices of the arrays equal to the number of threads:</p> <pre><code>inline float sum_slice(const float a[], const size_t N) { int nthreads = 4; const int offset = ROUND_DOWN(N/nthreads, nthreads); float suma[8] = {0}; #pragma omp parallel for num_threads(nthreads) for(int i=0; i&lt;nthreads; i++) { suma[i] = sum_vector4(&amp;a[i*offset], offset); } float sum = 0.0f; for(int i=0; i&lt;nthreads; i++) { sum += suma[i]; } for(int i=nthreads*offset; i &lt; N; i++) { sum += a[i]; } return sum; } inline float sum_vector4(const float a[], const size_t N) { __m128 sum4 = _mm_set1_ps(0.0f); int i = 0; for(; i &lt; ROUND_DOWN(N, 4); i+=4) { __m128 a4 = _mm_load_ps(a + i); sum4 = _mm_add_ps(sum4, a4); } __m128 t1 = _mm_hadd_ps(sum4,sum4); __m128 t2 = _mm_hadd_ps(t1,t1); float sum = _mm_cvtss_f32(t2); for(; i &lt; N; i++) { sum += a[i]; } return sum; </code></pre> <p>}</p> <p>Does someone know if there is a better way of doing reductions with more complicated operators in OpenMP?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload