StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCUDA GPU slower than CPU
text
Body
copied!<p>I am having trouble figuring out why my cuda code runs slower than my cpu code</p> <p>my desktop configuration is i7 2600S, geforce 560ti</p> <p>and my code is as follows:</p> <pre><code>int** kernel_shiftSeam(int **MCEnergyMat, int **newE, int *seam, int width, int height, int direction) { //time measurement float elapsed_time_ms = 0; cudaEvent_t start, stop; //threads per block dim3 threads(16,16); //blocks dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y); int *device_Seam; int *host_Seam; int seamSize; if(direction == 1) { seamSize = height*sizeof(int); host_Seam = (int*)malloc(seamSize); for(int i=0;i<height;i++) host_Seam[i] = seam[i]; } else { seamSize = width*sizeof(int); host_Seam = (int*)malloc(seamSize); for(int i=0;i<width;i++) host_Seam[i] = seam[i]; } cudaMalloc((void**)&device_Seam, seamSize); cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice); global_host_MC = MCEnergyMat; new_host_MC = newE; //copy host array to device cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice); for(int i=0;i<width;i++) cudaMemcpy(global_MC2[i], global_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice); cudaMemcpy(new_MC, new_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice); for(int i=0;i<width;i++) cudaMemcpy(new_MC2[i], new_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice); cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); //do some operations on the 2d matrix gpu_shiftSeam<<< blocks,threads >>>(global_MC, new_MC, device_Seam, width, height); //measure end time for cpu calcuations cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); execTime += elapsed_time_ms; //copy out the data back to host (RESULT) for(int i=0;i<width;i++) { cudaMemcpy(newE[i], new_MC2[i], sizeof(int)*height, cudaMemcpyDeviceToHost); } return newE; } </code></pre> <p>I looped it 800 times and I got the follow results:</p> <p>GPU Computation Time (the gpu_shiftseam part) : 1176ms Total program run time: 22s</p> <p>CPU Computation Time (same operation as gpu_shiftseam but on host) : 12522ms Total program run time: 12s</p> <p>Apparently the GPU computation time is way shorter than the one on CPU, but for some reason the total program run time for gpu is a lot longer, does anyone know why? Is it because of the number of threads/blocks I am assigning is incorrect? Or is the "slowness" coming from allocating memory on device?</p> <p>Thanks a lot! </p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload