StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
12948364
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2012-10-18T06:25:21.683
FavoriteCount
0
LastActivityDate
2014-09-11T05:25:51.423
LastEditDate
2014-09-11T05:25:51.423
LastEditorUserId
1642811
OwnerUserId
1642811
ParentId
12172279
PostTypeId
2
Score
8
ViewCount
0
LastEditorDisplayName
text
Body
Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4.. <pre><code>int main() { int *a, *b, *c; //To store your matrix A & B in RAM. Result will be stored in matrix C int *ad, *bd, *cd; // To store matrices into GPU's RAM. int N =4; //No of rows and columns. size_t size=sizeof(float)* N * N; a=(float*)malloc(size); //Allocate space of RAM for matrix A b=(float*)malloc(size); //Allocate space of RAM for matrix B //allocate memory on device cudaMalloc(&ad,size); cudaMalloc(&bd,size); cudaMalloc(&cd,size); //initialize host memory with its own indices for(i=0;i<N;i++) { for(j=0;j<N;j++) { a[i * N + j]=(float)(i * N + j); b[i * N + j]= -(float)(i * N + j); } } //copy data from host memory to device memory cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); //calculate execution configuration dim3 grid (1, 1, 1); dim3 block (16, 1, 1); //each block contains N * N threads, each thread calculates 1 data element add_matrices<<<grid, block>>>(ad, bd, cd, N); cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost); printf("Matrix A was---\n"); for(i=0;i<N;i++) { for(j=0;j<N;j++) printf("%f ",a[i*N+j]); printf("\n"); } printf("\nMatrix B was---\n"); for(i=0;i<N;i++) { for(j=0;j<N;j++) printf("%f ",b[i*N+j]); printf("\n"); } printf("\nAddition of A and B gives C----\n"); for(i=0;i<N;i++) { for(j=0;j<N;j++) printf("%f ",c[i*N+j]); //if correctly evaluated, all values will be 0 printf("\n"); } //deallocate host and device memories cudaFree(ad); cudaFree(bd); cudaFree (cd); free(a); free(b); free(c); getch(); return 1; } /////Kernel Part __global__ void add_matrices(float *ad,float *bd,float *cd,int N) { int index; index = blockIDx.x * blockDim.x + threadIDx.x cd[index] = ad[index] + bd[index]; } </code></pre> Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16.. First of all you have to decide your thread configuration. You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device. Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535). Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64) Now our problem is addition of 16 * 16 matrices.. <pre><code>A | 1 2 3 4 | B | 1 2 3 4 | C| 1 2 3 4 | | 5 6 7 8 | + | 5 6 7 8 | = | 5 6 7 8 | | 9 10 11 12 | | 9 10 11 12 | | 9 10 11 12 | | 13 14 15 16| | 13 14 15 16| | 13 14 15 16| </code></pre> We need 16 threads to perform the computation. <pre><code>i.e. A(1,1) + B (1,1) = C(1,1) A(1,2) + B (1,2) = C(1,2) . . . . . . A(4,4) + B (4,4) = C(4,4) </code></pre> All these threads will get executed simultaneously. So we need a block with 16 threads. For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads. so, grid configuration will be <code>dim3 Grid(1,1,1)</code> i.e. grid will have only one block and block configuration will be <code>dim3 block(16,1,1)</code> i.e. block will have 16 threads arranged column wise. Following program will give you the clear idea about its execution.. Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow is a CUDA kernel launched?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USsandeep.ganage
UserOwnerUserId
1. USsandeep.ganage
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.