Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..</p> <pre><code>int main() { int *a, *b, *c; //To store your matrix A &amp; B in RAM. Result will be stored in matrix C int *ad, *bd, *cd; // To store matrices into GPU's RAM. int N =4; //No of rows and columns. size_t size=sizeof(float)* N * N; a=(float*)malloc(size); //Allocate space of RAM for matrix A b=(float*)malloc(size); //Allocate space of RAM for matrix B //allocate memory on device cudaMalloc(&amp;ad,size); cudaMalloc(&amp;bd,size); cudaMalloc(&amp;cd,size); //initialize host memory with its own indices for(i=0;i&lt;N;i++) { for(j=0;j&lt;N;j++) { a[i * N + j]=(float)(i * N + j); b[i * N + j]= -(float)(i * N + j); } } //copy data from host memory to device memory cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); //calculate execution configuration dim3 grid (1, 1, 1); dim3 block (16, 1, 1); //each block contains N * N threads, each thread calculates 1 data element add_matrices&lt;&lt;&lt;grid, block&gt;&gt;&gt;(ad, bd, cd, N); cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost); printf("Matrix A was---\n"); for(i=0;i&lt;N;i++) { for(j=0;j&lt;N;j++) printf("%f ",a[i*N+j]); printf("\n"); } printf("\nMatrix B was---\n"); for(i=0;i&lt;N;i++) { for(j=0;j&lt;N;j++) printf("%f ",b[i*N+j]); printf("\n"); } printf("\nAddition of A and B gives C----\n"); for(i=0;i&lt;N;i++) { for(j=0;j&lt;N;j++) printf("%f ",c[i*N+j]); //if correctly evaluated, all values will be 0 printf("\n"); } //deallocate host and device memories cudaFree(ad); cudaFree(bd); cudaFree (cd); free(a); free(b); free(c); getch(); return 1; } /////Kernel Part __global__ void add_matrices(float *ad,float *bd,float *cd,int N) { int index; index = blockIDx.x * blockDim.x + threadIDx.x cd[index] = ad[index] + bd[index]; } </code></pre> <p>Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16..</p> <p>First of all you have to decide your thread configuration. You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.</p> <p>Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).</p> <p>Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)</p> <p>Now our problem is addition of 16 * 16 matrices..</p> <pre><code>A | 1 2 3 4 | B | 1 2 3 4 | C| 1 2 3 4 | | 5 6 7 8 | + | 5 6 7 8 | = | 5 6 7 8 | | 9 10 11 12 | | 9 10 11 12 | | 9 10 11 12 | | 13 14 15 16| | 13 14 15 16| | 13 14 15 16| </code></pre> <p>We need 16 threads to perform the computation.</p> <pre><code>i.e. A(1,1) + B (1,1) = C(1,1) A(1,2) + B (1,2) = C(1,2) . . . . . . A(4,4) + B (4,4) = C(4,4) </code></pre> <p>All these threads will get executed simultaneously. So we need a block with 16 threads. For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads.</p> <p>so, grid configuration will be <code>dim3 Grid(1,1,1)</code> i.e. grid will have only one block and block configuration will be <code>dim3 block(16,1,1)</code> i.e. block will have 16 threads arranged column wise.</p> <p>Following program will give you the clear idea about its execution.. Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload