StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POreduction with OpenMP with SSE/AVX
primarykey
Id
15430069
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-03-15T10:32:55.570
FavoriteCount
4
LastActivityDate
2013-04-24T08:13:09.883
LastEditDate
2013-04-24T08:13:09.883
LastEditorUserId
0
OwnerUserId
0
ParentId
0
PostTypeId
1
Score
5
ViewCount
2209
LastEditorDisplayName
user2088790
text
Body
I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to: <pre><code>inline float sum_scalar_openmp2(const float a[], const size_t N) { float sum = 0.0f; #pragma omp parallel { float sum_private = 0.0f; #pragma omp parallel for nowait for(int i=0; i<N; i++) { sum_private += a[i]; } #pragma omp atomic sum += sum_private; } return sum; } </code></pre> I got this idea from the follow link: <a href="http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause" rel="nofollow">http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause</a> But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the reduction with OpenMP and SSE like this: <pre><code>#define ROUND_DOWN(x, s) ((x) & ~((s)-1)) inline float sum_vector4_openmp(const float a[], const size_t N) { __m128 sum4 = _mm_set1_ps(0.0f); #pragma omp parallel { __m128 sum4_private = _mm_set1_ps(0.0f); #pragma omp for nowait for(int i=0; i < ROUND_DOWN(N, 4); i+=4) { __m128 a4 = _mm_load_ps(a + i); sum4_private = _mm_add_ps(a4, sum4_private); } #pragma omp critical sum4 = _mm_add_ps(sum4_private, sum4); } __m128 t1 = _mm_hadd_ps(sum4,sum4); __m128 t2 = _mm_hadd_ps(t1,t1); float sum = _mm_cvtss_f32(t2); for(int i = ROUND_DOWN(N, 4); i < N; i++) { sum += a[i]; } return sum; } </code></pre> However, this function does not perform as well as I hope. I'm using Visual Studio 2012 Express. I know I can improve the performance a bit by unrolling the SSE load/add a few times but that still is less than I expect. I get much better performance by running over slices of the arrays equal to the number of threads: <pre><code>inline float sum_slice(const float a[], const size_t N) { int nthreads = 4; const int offset = ROUND_DOWN(N/nthreads, nthreads); float suma[8] = {0}; #pragma omp parallel for num_threads(nthreads) for(int i=0; i<nthreads; i++) { suma[i] = sum_vector4(&a[i*offset], offset); } float sum = 0.0f; for(int i=0; i<nthreads; i++) { sum += suma[i]; } for(int i=nthreads*offset; i < N; i++) { sum += a[i]; } return sum; } inline float sum_vector4(const float a[], const size_t N) { __m128 sum4 = _mm_set1_ps(0.0f); int i = 0; for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a4 = _mm_load_ps(a + i); sum4 = _mm_add_ps(sum4, a4); } __m128 t1 = _mm_hadd_ps(sum4,sum4); __m128 t2 = _mm_hadd_ps(t1,t1); float sum = _mm_cvtss_f32(t2); for(; i < N; i++) { sum += a[i]; } return sum; </code></pre> } Does someone know if there is a better way of doing reductions with more complicated operators in OpenMP?
Tags
<c><openmp><sse><avx>
Title
reduction with OpenMP with SSE/AVX
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. This table or related slice is empty.
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
3. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POreduction with OpenMP with SSE/AVX
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POreduction with OpenMP with SSE/AVX
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POreduction with OpenMP with SSE/AVX
 UserUserId
 USwonder
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. COWhat is the Size of N that you are using to do the test?
 singulars
 PostPostId
 POreduction with OpenMP with SSE/AVX
 UserUserId
 USveda
2. COI do the tests on the cache sizes: N(L1) = 32k, N(L2) = 256K, N(L3/4) = 2M, and N>>N(L3).
 singulars
 PostPostId
 POreduction with OpenMP with SSE/AVX
 UserUserId
 This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.