Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.</p> <p>If I were writing code to do what you want, it would probably look like this:</p> <pre><code>template &lt;int blocksize&gt; __global__ void calcRatio(float *orig, float *modified, int size, float *result, int *count, const float error) { __shared__ volatile float buff[blocksize]; int index = threadIdx.x + blockIdx.x * blockDim.x; int stride = blockDim.x * gridDim.x; int count = 0; for(int i=index; i&lt;n; i+=stride) { val = orig[index] - modified[index]; count += (val &lt; error); result[index] = val; } buff[threadIdx.x] = count; __syncthreads(); // Parallel reduction in shared memory using 1 warp if (threadId.x &lt; warpSize) { for(int i=threadIdx.x + warpSize; i&lt;blocksize; i+= warpSize) { buff[threadIdx.x] += buff[i]; if (threadIdx.x &lt; 16) buff[threadIdx.x] +=buff[threadIdx.x + 16]; if (threadIdx.x &lt; 8) buff[threadIdx.x] +=buff[threadIdx.x + 8]; if (threadIdx.x &lt; 4) buff[threadIdx.x] +=buff[threadIdx.x + 4]; if (threadIdx.x &lt; 2) buff[threadIdx.x] +=buff[threadIdx.x + 2]; if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1]; } } </code></pre> <p>The first stanza does what your serial code does - computes a difference and a <em>thread local</em> total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.</p> <p>The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the <em>block</em>. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.</p> <p>Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload