Note that there are some explanatory texts on larger screens.

plurals
  1. POCUDA thread execution in moving data in shared memoty
    text
    copied!<p>I have the following function:</p> <p><strong>code example 1:</strong></p> <pre><code>__global__ void func(const int *input, int N){ extern __shared__int buffer[]; int temp = 0; for(int i = blockIdx.x*blockDim.x + threadIdx.x; i &lt; N; i += blockDim.x*gridDim.x; ){ temp += input[i]; } buffer[threadIdx.x] = temp; __syncthreads(); } </code></pre> <p>It is a part of a parallel reduction function. As far I understand it copies from global to shared memory.</p> <p>I have tried to understand it by a simple example. For example I have a 1D array of size 20 elements (N=20). I imagine the execution as follows. Correct me If I am wrong. For 5 blocks of 4 threads each.</p> <p><strong>Execution for all threads of the first block:</strong></p> <pre><code>blockIdx.x=0 threadIdx.x=0 for(i=0; i&lt;18; i+= 4*5){ temp= in[0] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp blockIdx.x=0 threadIdx.x=1 for(i=1; i&lt;18; i+= 4*5){ temp= in[1] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp blockIdx.x=0 threadIdx.x=2 for(i=2; i&lt;18; i+= 4*5){ temp= in[2] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp blockIdx.x=0 threadIdx.x=3 for(i=3; i&lt;18; i+= 4*5){ temp= in[3] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp </code></pre> <p><strong>Execution for all threads of the second block:</strong></p> <pre><code>blockIdx.x=1 threadIdx.x=0 for(i=1*4; i&lt;18; i+= 4*5){ temp= in[4] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp blockIdx.x=1 threadIdx.x=1 for(i=1*4+1; i&lt;18; i+= 4*5){ temp = in[5] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp blockIdx.x=1 threadIdx.x=2 for(i=1*4+2; i&lt;18; i+= 4*5){ temp = in[6] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp blockIdx.x=1 threadIdx.x=3 for(i=1*4+3; i&lt;18; i+= 4*5){ temp = in[7] /i wrote the sums intuitively/} buffer[threadIdx.x] = temp </code></pre> <p><strong>e.t.c.</strong></p> <p>Why do we have a for loop instead of just writing:</p> <p><strong>code example 2:</strong></p> <pre><code>unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; buffer[threadIdx.x] = input[i]; </code></pre> <p>Can someone give an intuitive example or explanation?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload