Note that there are some explanatory texts on larger screens.

plurals
  1. POCUDA kernel is slower than CPU
    primarykey
    data
    text
    <p>I'm new to CUDA and I'm probably doing something wrong. All I need is logical operation on two binary vectors. Length of vectors is 2048000. I compared speed between logical <code>and</code> in Matlab's C mex file and in CUDA kernel. C on CPU is ~5% faster than CUDA. Please note that I measured only kernel execution (without memory transfer). I have i7 930 and 9800GT.</p> <pre><code>##MEX file testCPU.c:## #include "mex.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { int i, varLen; unsigned char *vars, *output; vars = mxGetPr(prhs[0]); plhs[0] = mxCreateLogicalMatrix(2048000, 1); output = mxGetPr(plhs[0]); for (i=0;i&lt;2048000;i++){ output[i] = vars[i] &amp; vars[2048000+i]; } } </code></pre> <p>Compile</p> <pre><code>mex testCPU.c </code></pre> <p>Create vectors</p> <pre><code>vars = ~~(randi(2,2048000,2)-1); </code></pre> <p>Measure speed:</p> <pre><code>tic;testCPU(vars);toc; </code></pre> <p><strong>CUDA</strong>:</p> <pre><code>#CUDA file testGPU.cu# #include "mex.h" #include "cuda.h" __global__ void logical_and(unsigned char* in, unsigned char* out, int N) { int idx = blockIdx.x*blockDim.x+threadIdx.x; out[idx] = in[idx] &amp;&amp; in[idx+N]; } void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { int i; unsigned char *vars, *output, *gpu, *gpures; vars = (unsigned char*)mxGetData(prhs[0]); plhs[0] = mxCreateLogicalMatrix(2048000, 1); output = (unsigned char*)mxGetData(plhs[0]); cudaEvent_t start, stop; cudaEventCreate(&amp;start); cudaEventCreate(&amp;stop); float dt_ms; // input GPU malloc cudaEventRecord(start, 0); cudaMalloc( (void **) &amp;gpu, sizeof(unsigned char)*4096000); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&amp;dt_ms, start, stop); printf("GPU input malloc: %f ms, %i\n", dt_ms, cudaGetLastError()); // output GPU malloc cudaEventRecord(start, 0); cudaMalloc( (void **) &amp;gpures, sizeof(unsigned char)*2048000); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&amp;dt_ms, start, stop); printf("GPU output malloc: %f ms, %i\n", dt_ms, cudaGetLastError()); // copy from CPU to GPU cudaEventRecord(start, 0); cudaMemcpy( gpu, vars, sizeof(unsigned char)*4096000, cudaMemcpyHostToDevice); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&amp;dt_ms, start, stop); printf("copy input from CPU to GPU: %f ms, %i\n", dt_ms, cudaGetLastError()); dim3 dimBlock(32); printf("thread count: %i\n", dimBlock.x); dim3 dimGrid(2048000/dimBlock.x); printf("block count: %i\n", dimGrid.x); // --- KERNEL --- cudaEventRecord(start, 0); logical_and&lt;&lt;&lt;dimGrid, dimBlock&gt;&gt;&gt;(gpu, gpures, 2048000); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&amp;dt_ms, start, stop); printf("GPU kernel: %f ms, %i\n", dt_ms, cudaGetLastError()); // result from GPU to CPU cudaEventRecord(start, 0); cudaMemcpy( output, gpures, sizeof(unsigned char)*2048000, cudaMemcpyDeviceToHost ); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&amp;dt_ms, start, stop); printf("copy output from GPU to CPU: %f ms, %i\n", dt_ms, cudaGetLastError()); cudaFree(gpu); cudaFree(gpures); } </code></pre> <p>Compile:</p> <pre><code> nvmex -f nvmexopts_9.bat testGPU.cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include" -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\lib\x64" -lcudart -lcufft </code></pre> <p>Output:</p> <pre><code>GPU input malloc: 0.772160 ms, 0 GPU output malloc: 0.041728 ms, 0 copy input from CPU to GPU: 1.494784 ms, 0 thread count: 32 block count: 64000 *** GPU kernel: 3.761216 ms, 0 *** copy output from GPU to CPU: 1.203488 ms, 0 </code></pre> <p>Is that code OK? CPU was ~0.1ms faster than CUDA kernel. I tried different thread counts (multipliers of 32) up to 512, 32 was fastest. Operator &amp; instead of &amp;&amp; was almost 1ms slower.</p> <p>Is 9800GT really so weak? What speed-up can I expect with today's mainstream card (ie. GTX460,560)?</p> <p>Thank you</p> <h3>EDIT: based on talonmies' comment, I made these modifications:</h3> <p>Kernel function:</p> <pre><code>__global__ void logical_and(uchar4* in, uchar4* out, int N) { int idx = blockIdx.x*blockDim.x+threadIdx.x; out[idx].x = in[idx].x &amp; in[idx+N].x; out[idx].y = in[idx].y &amp; in[idx+N].y; out[idx].z = in[idx].z &amp; in[idx+N].z; out[idx].w = in[idx].w &amp; in[idx+N].w; } </code></pre> <p>Main function:</p> <pre><code>uchar4 *gpu, *gpures; // 32 was worst, 64,128,256,512 were similar dim3 dimBlock(128); // block count is now 4xtimes smaller dim3 dimGrid(512000/dimBlock.x); </code></pre> <p>Output:</p> <pre><code>GPU input malloc: 0.043360 ms, 0 GPU output malloc: 0.038592 ms, 0 copy input from CPU to GPU: 1.499584 ms, 0 thread count: 128 block count: 4000 *** GPU kernel: 0.131296 ms, 0 *** copy output from GPU to CPU: 1.281120 ms, 0 </code></pre> <p>Is that correct? Almost 30x speed-up! It seems too good to be true, but result is correct :) How faster will be GTX560 on this particular task? Thx</p> <h3>Edit 2:</h3> <p>Is this code</p> <pre><code>__global__ void logical_and(uchar4* in, uchar4* out, int N) { int idx = blockIdx.x*blockDim.x+threadIdx.x; out[idx].x = in[idx].x &amp; in[idx+N].x; out[idx].y = in[idx].y &amp; in[idx+N].y; out[idx].z = in[idx].z &amp; in[idx+N].z; out[idx].w = in[idx].w &amp; in[idx+N].w; } </code></pre> <p>automatically transformed to:</p> <pre><code>__global__ void logical_and(uchar4* in, uchar4* out, int N) { int idx = blockIdx.x*blockDim.x+threadIdx.x; uchar4 buff; buff.x = in[idx].x; buff.y = in[idx].y; buff.z = in[idx].z; buff.w = in[idx].w; buff.x &amp;= in[idx+N].x; buff.y &amp;= in[idx+N].y; buff.z &amp;= in[idx+N].z; buff.w &amp;= in[idx+N].w; out[idx].x = buff.x; out[idx].y = buff.y; out[idx].z = buff.z; out[idx].w = buff.w; } </code></pre> <p>by compiler?</p> <p>If it is correct, it explains my confusion about coalesced access. I thought that <code>in[idx] &amp; in[idx+N]</code> leads to non-coalesced access, because of accessing non-contiguous memory. But in fact, <code>in[idx]</code> and <code>in[idx+N]</code> are loaded in two coalesced steps. <code>N</code> can be any multiple of 16, because uchar4 is 4 bytes long, and for coalesced access address must be aligned to 64 bytes (on 1.1 device). Am I right?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload