StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>In CUDA the threads are scheduled on individual SM's as warps. Each warp can contain at max 32 threads.The scheduler will try to execute the warps in parallel fashion inside the SM's.If the data for a particular warp is not ready,it is held by the scheduler until it is available.Now as far as your questions is concerned, I believe it is possible to do what you are trying to achieve using <code>cudaEvent_t</code> (for measuring the execution time of a kernel). </p> <p>The launch configuration that is <code>kernel<<<B,Tnum>>>(arg1...argn);</code> totally depends upon how much parallelism can you exploit in your algorithm. Also,the number of threads is something you must decide based on the best execution time that you get by launching the kernel. </p> <p>In many cases,launching multiple blocks with <code>128/256</code> threads suffices just to achieve optimal speed up. To give an example,lets say we want to add individual elements of two arrays of size <code>1024</code> into third array, the kernel function with 1 block would look something like </p> <pre><code>__global__ void kadd(int *c,int *a,int *b) { unsigned int tid = threadIdx.x;//Since only one block of 1024 threads suffices if(tid < MAXNUM) //MAXNUM = 1024 c[tid] = a[tid]+ b[tid]; } </code></pre> <p>And the launch config would be</p> <pre><code>kadd<<<1,1024>>>(c,a,b); </code></pre> <p>This however, will only execute a block on one of the SM's of your GPU,which means you are not utilizing the GPU resources fully. To get more from your GPU, what you can do is you can use multiple blocks and threads.The kernel will look something like </p> <pre><code>__global__ void kadd(int *c,int *a,int *b) { unsigned int tid = blockIDx.x * blockDim.x + threadIdx.x;//Since multiple blocks are used if(tid < MAXNUM) //MAXNUM = 1024 c[tid] = a[tid]+ b[tid]; } </code></pre> <p>and the corresponding launch configuration would be</p> <pre><code>kadd<<<8,128>>>(c,a,b); </code></pre> <p>This will launch <code>8</code> blocks of <code>128</code> threads each.You can play around with this launch configurations based on your algorithm requirements. You can further explore these launch configurations by launching <code>2D</code> or <code>3D</code> grids to get the most out of your GPU.</p> <p>As such, timing the kernel would give you the configuration that suits best for your requirements. This will also change depending on use of shared memory,coalesced access of global memory and other factors. In the end, I would like to mention that there is an occupancy calculator made available by NVIDIA which you can use to find the best combination on blocks and threads,for achieving higher occupancy.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload