Note that there are some explanatory texts on larger screens.

plurals
  1. POLow performance CUDA code on GT540M
    primarykey
    data
    text
    <p>Executing the following code sample takes ~750 ms on a GeForce GT540M whereas the same code executes in ~250 ms on a GT330M.</p> <p>Copying the dev_a and dev_b to the CUDA device memory takes ~350 ms on the GT540M and ~250. The execution of "addCuda" and the copying back to the host takes another ~400 ms on GT540M and ~0 ms for the GT330M.</p> <p>This is not what I expected, so I checked the devices' properties and discovered that the GT540M device surpasses or equals GT330M in every way except in the number of multiprocessors - GT540M has 2 and GT330M has 6. Can this really be true? And if so, can it really have such a great impact on the execution time?</p> <pre><code>#include "cuda_runtime.h" #include "device_launch_parameters.h" #include &lt;stdio.h&gt; #include &lt;stdlib.h&gt; #include &lt;time.h&gt; #include &lt;math.h&gt; #define T 512 #define N 60000*T __global__ void addCuda(double *a, double *b, double *c) { int tid = threadIdx.x + blockIdx.x * blockDim.x; if(tid &lt; N) { c[tid] = sqrt(fabs(a[tid] * b[tid] / 12.34567)) * cos(a[tid]); } } int main() { double *dev_a, *dev_b, *dev_c; double* a = (double*)malloc(N*sizeof(double)); double* b = (double*)malloc(N*sizeof(double)); double* c = (double*)malloc(N*sizeof(double)); printf("Filling arrays (CPU)...\n\n"); int i; for(i = 0; i &lt; N; i++) { a[i] = (double)-i; b[i] = (double)i; } int timer = clock(); cudaMalloc((void**) &amp;dev_a, N*sizeof(double)); cudaMalloc((void**) &amp;dev_b, N*sizeof(double)); cudaMalloc((void**) &amp;dev_c, N*sizeof(double)); cudaMemcpy(dev_a, a, N*sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(double), cudaMemcpyHostToDevice); printf("Memcpy time: %d\n", clock() - timer); addCuda&lt;&lt;&lt;(N+T-1)/T,T&gt;&gt;&gt;(dev_a, dev_b, dev_c); cudaMemcpy(c, dev_c, N*sizeof(double), cudaMemcpyDeviceToHost); printf("Time elapsed: %d\n", clock() - timer); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); free(a); free(b); free(c); return 0; } </code></pre> <p>The device properties for the devices:</p> <p><strong>GT540M:</strong></p> <pre><code>Major revision number: 2 Minor revision number: 1 Name: GeForce GT 540M Total global memory: 1073741824 Total shared memory per block: 49152 Total registers per block: 32768 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 1024 Maximum dimension 0 of block: 1024 Maximum dimension 1 of block: 1024 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 65535 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 65535 Clock rate: 1344000 Total constant memory: 65536 Texture alignment: 512 Concurrent copy and execution: Yes Number of multiprocessors: 2 Kernel execution timeout: Yes </code></pre> <p><strong>GT330M</strong></p> <pre><code>Major revision number: 1 Minor revision number: 2 Name: GeForce GT 330M Total global memory: 268435456 Total shared memory per block: 16384 Total registers per block: 16384 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 512 Maximum dimension 0 of block: 512 Maximum dimension 1 of block: 512 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 65535 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 1 Clock rate: 1100000 Total constant memory: 65536 Texture alignment: 256 Concurrent copy and execution: Yes Number of multiprocessors: 6 Kernel execution timeout: Yes </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload