StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
8865134
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
6
CommunityOwnedDate
CreationDate
2012-01-14T20:27:49.513
FavoriteCount
0
LastActivityDate
2012-01-14T20:59:44.403
LastEditDate
2012-01-14T20:59:44.403
LastEditorUserId
201270
OwnerUserId
201270
ParentId
8864786
PostTypeId
2
Score
6
ViewCount
0
LastEditorDisplayName
text
Body
The first reduction code you gave should work as long as only one workgroup is working on the reduction (so <code>get_global_size(0) == get_local_size(0)</code>). In that case the <code>size</code> argument of the kernel would be the number of elements in <code>A</code> (which has no real correlation to either the global or the local worksize). While that is a workable solution, it seems inheriantly wasteful to let most of the <code>gpu</code> idle while doing the reduction, which is precisely why I proposed iteratively calling a reduction kernel. This would be made possible with only slight modifications to the code: <pre><code>__kernel void sum(__global const short *A, __global unsigned long *C, uint size, __local unsigned long *L) { unsigned long sum=0; for(int i=get_global_id(0); i < size; i += get_global_size(0)) sum += A[i]; L[get_local_id(0)]=sum; for(uint c=get_local_size(0)/2;c>0;c/=2) { barrier(CLK_LOCAL_MEM_FENCE); if(c>get_local_id(0)) L[get_local_id(0)]+=L[get_local_id(0)+c]; } if(get_local_id(0)==0) C[get_group_id(0)]=L[0]; barrier(CLK_LOCAL_MEM_FENCE); } </code></pre> Calling this with a <code>GlobalWorkSize</code> smaller then <code>size</code> (e.g. <code>4</code>) will reduce the input in <code>A</code> by a factor of <code>4*LocalWorkSize</code>, which can be iterated (by using the output buffer as input for the next call to <code>sum</code> with a different output buffer. Well actually that isn't quite true, since the second (and all following) iteration needs <code>A</code> to be of type <code>global const unsigned long*</code>, so you will actually need to kernels, but you get the idea. Concerning the cuda reduction sample: Why would you bother converting it, it works basically exactly like the opencl version I posted above does, except reducing only by a hardcoded size per iteration (<code>2*LocalWorkSize</code> insted of <code>size/GlobalWorkSize*LocalWorkSize</code>). Personally I use practically the same approach for the reduction, although I have split the kernel in two parts and only use the path using local memory for the last iteration: <pre><code>__kernel void reduction_step(__global const unsigned long* A, __global unsigned long * C, uint size) { unsigned long sum=0; for(int i=start; i < size; i += stride) sum += A[i]; C[get_global_id(0)]= sum; } </code></pre> For the final step the full version which does reduction inside the work group was used. Of course you would need a second version of <code>reduction step</code> taking <code>global const short*</code> and this code is an untested adaption of your code (I can't post my own version, regretably). The advantage of this approach is the much lesser complexity of the kernel doing most of the work, and less amount of <code>wasted work</code> due to divergent branches. Which made it a bit faster then the other variant. However I have no results for either the newest compilerversion nor the newest hardware so that point might or might not be correct anymore (though I suspect it might since due to the reduced amount of divergent branches). Now for the paper you linked in: It is certainly possible to use the optimizations suggested in that paper in opencl, except for the use of templates, which are not supported by opencl, so the blocksizes would have to be hardcoded. Of course the opencl version already does multiple adds per kernel and, if you follow the approach I mentioned above, would not really benefit from unrolling the reduction through local memory, since that is only done in the last step, which shouldn't take a significant part of the whole calculation time for a big enough imput. Furthermore I find the lack of synchronization in the unrolled implementation a bit troublesome. That only works because all threads going in that part belong to the same warp. This however isn't necessary true when executing on any hardware other then current nvidia cards (future nvidia cards, amd cards and cpus (although I think it should work for current amd cards and current cpu implementations, but I wouldn't necessarily count on it)), so I would stay away from that unless I needed the absolute last bit of speed for the reduction (and then still provide a generic version and switch to that if I don't recognize the hardware or something like that).
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POOpenCL: Reduction examples, and retaining memory objects / converting cuda code to openCL
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USGrizzly
UserOwnerUserId
1. USGrizzly
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POOpenCL: Reduction examples, and retaining memory objects / converting cuda code to openCL
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.