Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The first reduction code you gave should work as long as only one workgroup is working on the reduction (so <code>get_global_size(0) == get_local_size(0)</code>). In that case the <code>size</code> argument of the kernel would be the number of elements in <code>A</code> (which has no real correlation to either the global or the local worksize). While that is a workable solution, it seems inheriantly wasteful to let most of the <code>gpu</code> idle while doing the reduction, which is precisely why I proposed iteratively calling a reduction kernel. This would be made possible with only slight modifications to the code:</p> <pre><code>__kernel void sum(__global const short *A, __global unsigned long *C, uint size, __local unsigned long *L) { unsigned long sum=0; for(int i=get_global_id(0); i &lt; size; i += get_global_size(0)) sum += A[i]; L[get_local_id(0)]=sum; for(uint c=get_local_size(0)/2;c&gt;0;c/=2) { barrier(CLK_LOCAL_MEM_FENCE); if(c&gt;get_local_id(0)) L[get_local_id(0)]+=L[get_local_id(0)+c]; } if(get_local_id(0)==0) C[get_group_id(0)]=L[0]; barrier(CLK_LOCAL_MEM_FENCE); } </code></pre> <p>Calling this with a <code>GlobalWorkSize</code> smaller then <code>size</code> (e.g. <code>4</code>) will reduce the input in <code>A</code> by a factor of <code>4*LocalWorkSize</code>, which can be iterated (by using the output buffer as input for the next call to <code>sum</code> with a different output buffer. Well actually that isn't quite true, since the second (and all following) iteration needs <code>A</code> to be of type <code>global const unsigned long*</code>, so you will actually need to kernels, but you get the idea. </p> <p>Concerning the cuda reduction sample: Why would you bother converting it, it works basically exactly like the opencl version I posted above does, except reducing only by a hardcoded size per iteration (<code>2*LocalWorkSize</code> insted of <code>size/GlobalWorkSize*LocalWorkSize</code>). </p> <p>Personally I use practically the same approach for the reduction, although I have split the kernel in two parts and only use the path using local memory for the last iteration:</p> <pre><code>__kernel void reduction_step(__global const unsigned long* A, __global unsigned long * C, uint size) { unsigned long sum=0; for(int i=start; i &lt; size; i += stride) sum += A[i]; C[get_global_id(0)]= sum; } </code></pre> <p>For the final step the full version which does reduction inside the work group was used. Of course you would need a second version of <code>reduction step</code> taking <code>global const short*</code> and this code is an untested adaption of your code (I can't post my own version, regretably). The advantage of this approach is the much lesser complexity of the kernel doing most of the work, and less amount of <code>wasted work</code> due to divergent branches. Which made it a bit faster then the other variant. However I have no results for either the newest compilerversion nor the newest hardware so that point might or might not be correct anymore (though I suspect it might since due to the reduced amount of divergent branches). </p> <p>Now for the paper you linked in: It is certainly possible to use the optimizations suggested in that paper in opencl, except for the use of templates, which are not supported by opencl, so the blocksizes would have to be hardcoded. Of course the opencl version already does multiple adds per kernel and, if you follow the approach I mentioned above, would not really benefit from unrolling the reduction through local memory, since that is only done in the last step, which shouldn't take a significant part of the whole calculation time for a big enough imput. Furthermore I find the lack of synchronization in the unrolled implementation a bit troublesome. That only works because all threads going in that part belong to the same warp. This however isn't necessary true when executing on any hardware other then current nvidia cards (future nvidia cards, amd cards and cpus (although I think it should work for current amd cards and current cpu implementations, but I wouldn't necessarily count on it)), so I would stay away from that unless I needed the absolute last bit of speed for the reduction (and then still provide a generic version and switch to that if I don't recognize the hardware or something like that).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload