Note that there are some explanatory texts on larger screens.

plurals
  1. POUnderstanding memory usage in CUDA
    primarykey
    data
    text
    <p>I have a <strong><em>NVIDIA GTX 570</em></strong> graphics card running on a <strong><em>Ubuntu 10.10 system</em></strong> with <strong><em>Cuda 4.0.</em></strong> </p> <p>I know that for performance, we need to access memory efficiently, and use <strong>register</strong> and <strong>shared</strong> memory on the device cleverly. </p> <p>However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations. </p> <p>I want to understand this by an <strong><em>explicit</em></strong> example. Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this.</p> <p>Each block is a <strong><em>1-D</em></strong> collection of threads, and each grid is a <strong><em>1-D</em></strong> collection of blocks.</p> <ul> <li>Number of blocks : <strong><em>16384</em></strong> </li> <li>Number of threads per block : <strong><em>32</em></strong> ( => total threads 32*16384 = <strong><em>524288</em></strong>) </li> <li><strong>Each</strong> thread-block is given a <strong><em>32 x 32</em></strong> two-d integer array of shared memory to work with.</li> </ul> <p>Within a thread I would like to store some numbers of type <code>double</code>. But I am not sure how many such <code>double</code> numbers I can store without any register spilling into local memory (which is on device). Can someone tell me how many doubles can be stored per thread for this kernel configuration?</p> <p>Also is the above mentioned configuration for shared-memory for each of my blocks valid? </p> <p>A sample computation about how one would go about deducing these things would be very illustrative and helpful</p> <p>Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK)</p> <pre><code>[deviceQuery] starting... ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Found 1 CUDA Capable device(s) Device 0: "GeForce GTX 570" CUDA Driver Version / Runtime Version 4.0 / 4.0 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 1279 MBytes (1341325312 bytes) (15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores GPU Clock Speed: 1.46 GHz Memory Clock rate: 1900.00 Mhz Memory Bus Width: 320-bit L2 Cache Size: 655360 bytes Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535 Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: Yes Alignment requirement for Surfaces: Yes Device has ECC support enabled: No Device is using TCC driver mode: No Device supports Unified Addressing (UVA): Yes Device PCI Bus ID / PCI location ID: 2 / 0 Compute Mode: &lt; Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) &gt; deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570 [deviceQuery] test results... PASSED Press ENTER to exit... </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload