StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
16159039
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-04-23T00:38:55.637
FavoriteCount
0
LastActivityDate
2013-04-25T20:10:49.320
LastEditDate
2013-04-25T20:10:49.320
LastEditorUserId
1563889
OwnerUserId
1563889
ParentId
16157754
PostTypeId
2
Score
2
ViewCount
0
LastEditorDisplayName
text
Body
So, the problem here seems to be in setting up threads and blocks and using threadIdx, blockDim and gridDim. NOTE: practical solution to this particular problem at the label Practical solution threadIdx is as the name says the ID of the thread. That means this value, or more preciselly it's threadIdx.x and threadIdx.y components will go from values of 0 to specified thread count or rather threads per block value which is stored in blockDim.x and blockDim.y . For example a call <pre><code>someKernel<<<1,32>>>( .... ); </code></pre> would result in threadIdx.x going from values of 0 to 31 and threadIdx.y would not be iterated at all (i presume it would always be 0). If you however define a cuda specific structure dim3 and call it threadsPerBlock , and then use it as the second argument like this: <pre><code>dim3 threadsPerBlock( 32, 32 ); someKernel<<<1,threadsPerBlock>>>( .... ); </code></pre> then you would get both threadIdx.x and threadIdx.y to go from 0 to 31 getting all kinds of combinations of them in the kernel execution. Note that you are restricted to a certain maximum number of threads per block launched. This number is different for different graphic cards, or more precisely, the compute capability they support. Look for these numbers in the table at the end of <a href="http://en.wikipedia.org/wiki/CUDA" rel="nofollow">this link</a> So, compute capability 2.x and up supports a maximum of 1024 threads per block, while earlier versions support 512. Note also that this means a maximum of 32x32 threads per block when launching in 2 dimensions. But what if you need more than that? Well son, then you launch more blocks! You can also launch blocks in 1 or 2 dimensions. For example <pre><code>dim3 threadsPerBlock( 32, 32 ); dim3 blocksPerGrid ( 256, 265 ); someKernel <<<blocksPerGrid,threadsPerBlock>>>( ... ); </code></pre> the size of the grid is stored in gridDim structure and in this case both gridDim.x and gridDim.y would be 256, making the blockIdx.x and blockIdx.y variables go from 0 to 255. Practical solution: Now that we know this, lets take a look at your code. In your code if you for example set T to be 32 and B to be 256, you would effectively get this: <pre><code>threadIdx.x would go from 0 to 31 threadIdx.y would go from 0 to 0 blockIdx.x would go from 0 to 255 blockIdx.y would go from 0 to 0 blockDim.x would be 32 blockDim.y would be 1 gridDim.x would be 256 gridDim.y would be 1 </code></pre> Now lets see how your variables react to this... <pre><code>row would go from 0 to 0 col would go from 0 to 1023 </code></pre> So, this is presumably not really what you want. You want both your row and col to go from 0 to N-1 right? Well, this is how you do it: <pre><code>int row = threadIdx.x + blockIdx.x * blockDim.x; int col = threadIdx.y + blockIdx.y * blockDim.y; </code></pre> Also make sure that you have enough threads to cover the dimensions of the matrix. That is make sure that you set *threadsPerBlock*blocksPerGrid* to be greater than your N. This is usually best done this way: <pre><code>threads = 32 dim3 threadsPerBlock ( threads, threads ); blocks = (N / threads) + 1; dim3 blocksPerGrid ( blocks, blocks ); </code></pre> "But if I make it greater than N, then I might have some threads that I dont need" - say you - "I don't want them to do work!" And wise you are sir, to say that. You solve this by simple if clause in which you will enclose your calculations, like so: <pre><code>if ( row < N && col < N ) { // your add... err... code here } </code></pre> Hope that helps. Enjoy CUDA ;)
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POIncorrect results for CUDA Matrix Multiplication
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USDan S.
UserOwnerUserId
1. USDan S.
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.