StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POLow performance CUDA code on GT540M
primarykey
Id
9258755
data
AcceptedAnswerId
9375533
AnswerCount
2
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2012-02-13T10:16:32.397
FavoriteCount
0
LastActivityDate
2012-02-21T10:02:33.533
LastEditDate
LastEditorUserId
0
OwnerUserId
1206487
ParentId
0
PostTypeId
1
Score
3
ViewCount
1693
LastEditorDisplayName
text
Body
Executing the following code sample takes ~750 ms on a GeForce GT540M whereas the same code executes in ~250 ms on a GT330M. Copying the dev_a and dev_b to the CUDA device memory takes ~350 ms on the GT540M and ~250. The execution of "addCuda" and the copying back to the host takes another ~400 ms on GT540M and ~0 ms for the GT330M. This is not what I expected, so I checked the devices' properties and discovered that the GT540M device surpasses or equals GT330M in every way except in the number of multiprocessors - GT540M has 2 and GT330M has 6. Can this really be true? And if so, can it really have such a great impact on the execution time? <pre><code>#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #define T 512 #define N 60000*T __global__ void addCuda(double *a, double *b, double *c) { int tid = threadIdx.x + blockIdx.x * blockDim.x; if(tid < N) { c[tid] = sqrt(fabs(a[tid] * b[tid] / 12.34567)) * cos(a[tid]); } } int main() { double *dev_a, *dev_b, *dev_c; double* a = (double*)malloc(N*sizeof(double)); double* b = (double*)malloc(N*sizeof(double)); double* c = (double*)malloc(N*sizeof(double)); printf("Filling arrays (CPU)...\n\n"); int i; for(i = 0; i < N; i++) { a[i] = (double)-i; b[i] = (double)i; } int timer = clock(); cudaMalloc((void**) &dev_a, N*sizeof(double)); cudaMalloc((void**) &dev_b, N*sizeof(double)); cudaMalloc((void**) &dev_c, N*sizeof(double)); cudaMemcpy(dev_a, a, N*sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(double), cudaMemcpyHostToDevice); printf("Memcpy time: %d\n", clock() - timer); addCuda<<<(N+T-1)/T,T>>>(dev_a, dev_b, dev_c); cudaMemcpy(c, dev_c, N*sizeof(double), cudaMemcpyDeviceToHost); printf("Time elapsed: %d\n", clock() - timer); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); free(a); free(b); free(c); return 0; } </code></pre> The device properties for the devices: GT540M: <pre><code>Major revision number: 2 Minor revision number: 1 Name: GeForce GT 540M Total global memory: 1073741824 Total shared memory per block: 49152 Total registers per block: 32768 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 1024 Maximum dimension 0 of block: 1024 Maximum dimension 1 of block: 1024 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 65535 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 65535 Clock rate: 1344000 Total constant memory: 65536 Texture alignment: 512 Concurrent copy and execution: Yes Number of multiprocessors: 2 Kernel execution timeout: Yes </code></pre> GT330M <pre><code>Major revision number: 1 Minor revision number: 2 Name: GeForce GT 330M Total global memory: 268435456 Total shared memory per block: 16384 Total registers per block: 16384 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 512 Maximum dimension 0 of block: 512 Maximum dimension 1 of block: 512 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 65535 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 1 Clock rate: 1100000 Total constant memory: 65536 Texture alignment: 256 Concurrent copy and execution: Yes Number of multiprocessors: 6 Kernel execution timeout: Yes </code></pre>
Tags
<c><performance><cuda><gpgpu><nvidia>
Title
Low performance CUDA code on GT540M
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USThorkil Holm-Jacobsen
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POLow performance CUDA code on GT540M
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POLow performance CUDA code on GT540M
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POLow performance CUDA code on GT540M
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.