StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSome basic CUDA enquiries
text
Body
copied!<p>I am new in Cuda development and I decided to start scripting small examples in order to understand how it is working. I decided to share the kernel function that I make and computes the squared euclidean distance between the corresponding rows of two equal sized matrices. </p> <pre><code>__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ) { int i, squareEuclDist = 0; int r = blockDim.x * blockIdx.x + threadIdx.x; // rows //int c = blockDim.y * blockIdx.y + threadIdx.y; // cols if( r < rows ){ // take each row with var r (thread) for ( i = 0; i < cols; i++ )//compute squared Euclid dist of each row squareEuclDist += ( A[r + rows*i] - B[r + rows*i] ) * ( A[r + rows*i] - B[r + rows*i] ); C[r] = squareEuclDist; squareEuclDist = 0; } } </code></pre> <p>The kernel initialization is done by </p> <pre><code>int threadsPerBlock = 256; int blocksPerGrid = ceil( (double) numElements / threadsPerBlock); // numElements = 1500x200 (matrix size) ==> 1172 blocks/grid </code></pre> <p>and is called as </p> <pre><code>cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols ); </code></pre> <p>The d_A and d_B are the inserted matrices, in this example of size 1500 <strong>x</strong> 200. </p> <p><strong>Question 1</strong>: I have read the basic theory of choosing the threads per block and the blocks per grid number but is still something missing. I try to understand in this simple kernel what is the optimum kernel parameter initialization and I am asking a little help to start thinking in CUDA way. </p> <p><strong>Question 2</strong>: An other thing I would like to ask is if there are any suggestions about how can we improve the code efficiency? Can we use <code>int c = blockDim.y * blockIdx.y + threadIdx.y</code> to make things more parallel?Share memory is applicable here? </p> <p>Below, my GPU info is attached.</p> <pre><code>Device 0: "GeForce 9600 GT" CUDA Driver Version / Runtime Version 5.5 / 5.0 CUDA Capability Major/Minor version number: 1.1 Total amount of global memory: 512 MBytes (536870912 bytes) ( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores GPU Clock rate: 1680 MHz (1.68 GHz) Memory Clock rate: 700 Mhz Memory Bus Width: 256-bit Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per multiprocessor: 768 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 256 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Concurrent kernel execution: No Device supports Unified Addressing (UVA): No Device PCI Bus ID / PCI location ID: 1 / 0 </code></pre> <p><strong>Question 3</strong>: Can we express the amount of global memory with that of shared memory and other type of memories that GPU has? Does the number of threads has to do with that?</p> <p><strong>Question 4</strong>: If the maximum number of threads per block is 512 how is possible the maximum sizes of each dimension of a block be 512x512x62 (= 16252628 threads)? What the correlation with my maximum sizes of each dimension of a grid?</p> <p><strong>Question 5</strong>: Using the memory clock rate can we say how many threads are processed at each second?</p> <p><strong>UPDATE</strong>:</p> <p>The for loop replaced with column threads</p> <pre><code>__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ){ int r = blockDim.x * blockIdx.x + threadIdx.x; // rows int c = blockDim.y * blockIdx.y + threadIdx.y; // cols float x=0; if(c < cols && r < rows){ x = ( A[c + r*cols] - B[c + r*cols] ) * ( A[c + r*cols] - B[c + r*cols] ); } C[r] = x; } </code></pre> <p>Called with:</p> <pre><code>int threadsPerBlock = 256; int blocksPerGrid = ceil( (double) numElements / threadsPerBlock); cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols ); </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload