StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POUnderstanding memory usage in CUDA
primarykey
Id
12380138
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2012-09-12T01:36:19.847
FavoriteCount
2
LastActivityDate
2014-11-21T05:11:31.133
LastEditDate
2014-11-21T05:11:31.133
LastEditorUserId
2770850
OwnerUserId
1037059
ParentId
0
PostTypeId
1
Score
3
ViewCount
3006
LastEditorDisplayName
text
Body
I have a NVIDIA GTX 570 graphics card running on a Ubuntu 10.10 system with Cuda 4.0. I know that for performance, we need to access memory efficiently, and use register and shared memory on the device cleverly. However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations. I want to understand this by an explicit example. Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this. Each block is a 1-D collection of threads, and each grid is a 1-D collection of blocks. <ul> <li>Number of blocks : 16384 </li> <li>Number of threads per block : 32 ( => total threads 32*16384 = 524288) </li> <li>Each thread-block is given a 32 x 32 two-d integer array of shared memory to work with.</li> </ul> Within a thread I would like to store some numbers of type <code>double</code>. But I am not sure how many such <code>double</code> numbers I can store without any register spilling into local memory (which is on device). Can someone tell me how many doubles can be stored per thread for this kernel configuration? Also is the above mentioned configuration for shared-memory for each of my blocks valid? A sample computation about how one would go about deducing these things would be very illustrative and helpful Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK) <pre><code>[deviceQuery] starting... ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Found 1 CUDA Capable device(s) Device 0: "GeForce GTX 570" CUDA Driver Version / Runtime Version 4.0 / 4.0 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 1279 MBytes (1341325312 bytes) (15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores GPU Clock Speed: 1.46 GHz Memory Clock rate: 1900.00 Mhz Memory Bus Width: 320-bit L2 Cache Size: 655360 bytes Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535 Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: Yes Alignment requirement for Surfaces: Yes Device has ECC support enabled: No Device is using TCC driver mode: No Device supports Unified Addressing (UVA): Yes Device PCI Bus ID / PCI location ID: 2 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570 [deviceQuery] test results... PASSED Press ENTER to exit... </code></pre>
Tags
<memory><memory-management><cuda>
Title
Understanding memory usage in CUDA
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USNabin
UserOwnerUserId
1. UScuriousexplorer
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POUnderstanding memory usage in CUDA
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POUnderstanding memory usage in CUDA
 UserUserId
 USFacundoGFlores
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.