Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>So, the problem here seems to be in setting up threads and blocks and using <em>threadIdx</em>, <em>blockDim</em> and <em>gridDim</em>.</p> <p><strong>NOTE: practical solution to this particular problem at the label Practical solution</strong></p> <p><em>threadIdx</em> is as the name says the ID of the thread. That means this value, or more preciselly it's <em>threadIdx.x</em> and <em>threadIdx.y</em> components will go from values of 0 to specified thread count or rather <strong>threads per block</strong> value which is stored in <em>blockDim.x</em> and <em>blockDim.y</em> . For example a call</p> <pre><code>someKernel&lt;&lt;&lt;1,32&gt;&gt;&gt;( .... ); </code></pre> <p>would result in <em>threadIdx.x</em> going from values of 0 to 31 and <em>threadIdx.y</em> would not be iterated at all (i presume it would always be 0).</p> <p>If you however define a cuda specific structure <em>dim3</em> and call it <em>threadsPerBlock</em> , and then use it as the second argument like this:</p> <pre><code>dim3 threadsPerBlock( 32, 32 ); someKernel&lt;&lt;&lt;1,threadsPerBlock&gt;&gt;&gt;( .... ); </code></pre> <p>then you would get both <em>threadIdx.x</em> and <em>threadIdx.y</em> to go from 0 to 31 getting all kinds of combinations of them in the kernel execution.</p> <p>Note that you are restricted to a certain maximum number of threads per block launched. This number is different for different graphic cards, or more precisely, the compute capability they support. Look for these numbers in the table at the end of <a href="http://en.wikipedia.org/wiki/CUDA" rel="nofollow">this link</a> So, compute capability 2.x and up supports a maximum of 1024 threads per block, while earlier versions support 512. Note also that this means a maximum of 32x32 threads per block when launching in 2 dimensions.</p> <p>But what if you need more than that? Well son, then you launch more blocks! You can also launch blocks in 1 or 2 dimensions. For example</p> <pre><code>dim3 threadsPerBlock( 32, 32 ); dim3 blocksPerGrid ( 256, 265 ); someKernel &lt;&lt;&lt;blocksPerGrid,threadsPerBlock&gt;&gt;&gt;( ... ); </code></pre> <p>the size of the grid is stored in <em>gridDim</em> structure and in this case both <em>gridDim.x</em> and <em>gridDim.y</em> would be 256, making the <em>blockIdx.x</em> and <em>blockIdx.y</em> variables go from 0 to 255.</p> <p><strong>Practical solution:</strong></p> <p>Now that we know this, lets take a look at your code. In your code if you for example set <strong>T</strong> to be 32 and <strong>B</strong> to be 256, you would effectively get this:</p> <pre><code>threadIdx.x would go from 0 to 31 threadIdx.y would go from 0 to 0 blockIdx.x would go from 0 to 255 blockIdx.y would go from 0 to 0 blockDim.x would be 32 blockDim.y would be 1 gridDim.x would be 256 gridDim.y would be 1 </code></pre> <p>Now lets see how your variables react to this...</p> <pre><code>row would go from 0 to 0 col would go from 0 to 1023 </code></pre> <p>So, this is presumably not really what you want. You want both your row and col to go from 0 to <strong>N-1</strong> right? Well, this is how you do it:</p> <pre><code>int row = threadIdx.x + blockIdx.x * blockDim.x; int col = threadIdx.y + blockIdx.y * blockDim.y; </code></pre> <p>Also make sure that you have enough threads to cover the dimensions of the matrix. That is make sure that you set *threadsPerBlock*blocksPerGrid* to be greater than your <strong>N</strong>. This is usually best done this way:</p> <pre><code>threads = 32 dim3 threadsPerBlock ( threads, threads ); blocks = (N / threads) + 1; dim3 blocksPerGrid ( blocks, blocks ); </code></pre> <p>"But if I make it greater than N, then I might have some threads that I dont need" - say you - "I don't want them to do work!" And wise you are sir, to say that. You solve this by simple if clause in which you will enclose your calculations, like so:</p> <pre><code>if ( row &lt; N &amp;&amp; col &lt; N ) { // your add... err... code here } </code></pre> <p>Hope that helps. Enjoy CUDA ;)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload