Note that there are some explanatory texts on larger screens.

plurals
  1. POcuda understanding concurrent kernel execution
    primarykey
    data
    text
    <p>I'm trying to understand how concurrent kernel execution works. I have written a simple program to try to understand it. The kernel will populate a 2D array using 2 streams. I am getting the correct results when there is 1 stream, no concurrency. when i try it with 2 streams, attempt at concurrency, i get the wrong results. I believe it's either something to do with the memory transfer as i'm not quite sure i have this correct or the way i have set up the kernel. The programming guide does not explain it well enough for me. For my purposes, i need Matlab to be calling the kernel.</p> <p>As i understand it, the main program will:</p> <ul> <li>allocate the pinned memory on host</li> <li>allocate the memory on the GPU required for a single stream (2 streams = half the total memory of the host)</li> <li>create the streams</li> <li>loop through the streams</li> <li>copy the memory for a single stream from host to the device using cudaMemcpyAsync()</li> <li>execute kernel for the stream</li> <li>copy the memory for the stream back to the host, cudaMemcpyAsync() <ul> <li>I believe i'm doing the right thing by referencing the memory from the location i need it for each stream using an offset based on the size of data for each stream and the stream number.</li> </ul></li> <li>destroy the streams</li> <li>free the memory</li> </ul> <p>here is the code i am attempting to use.</p> <p>concurrentKernel.cpp</p> <pre><code>__global__ void concurrentKernel(int const width, int const streamIdx, double *array) { int thread = (blockIdx.x * blockDim.x) + threadIdx.x;; for (int i = 0; i &lt; width; i ++) { array[thread*width+i] = thread+i*width+1; // array[thread*width+i+streamIdx] = thread+i*width+streamIdx*width/2; } } </code></pre> <p>concurrentMexFunction.cu</p> <pre><code>#include &lt;stdio.h&gt; #include &lt;math.h&gt; #include "mex.h" /* Kernel function */ #include "concurrentKernel.cpp" void mexFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[]) { int const numberOfStreams = 2; // set number of streams to use here. cudaError_t cudaError; int offset; int width, height, fullSize, streamSize; width = 512; height = 512; fullSize = height*width; streamSize = (int)(fullSize/numberOfStreams); mexPrintf("fullSize: %d, streamSize: %d\n",fullSize, streamSize); /* Return the populated array */ double *returnedArray; plhs[0] = mxCreateDoubleMatrix(height, width, mxREAL); returnedArray = mxGetPr(plhs[0]); cudaStream_t stream[numberOfStreams]; for (int i = 0; i &lt; numberOfStreams; i++) { cudaStreamCreate(&amp;stream[i]); } /* host memory */ double *hostArray; cudaError = cudaMallocHost(&amp;hostArray,sizeof(double)*fullSize); // full size of array. if (cudaError != cudaSuccess) {mexPrintf("hostArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } for (int i = 0; i &lt; height; i++) { for (int j = 0; j &lt; width; j++) { hostArray[i*width+j] = -1.0; } } /* device memory */ double *deviceArray; cudaError = cudaMalloc( (void **)&amp;deviceArray,sizeof(double)*streamSize); // size of array for each stream. if (cudaError != cudaSuccess) {mexPrintf("deviceArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } for (int i = 0; i &lt; numberOfStreams; i++) { offset = i;//*streamSize; mexPrintf("offset: %d, element: %d\n",offset*sizeof(double),offset); cudaMemcpyAsync(deviceArray, hostArray+offset, sizeof(double)*streamSize, cudaMemcpyHostToDevice, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("deviceArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } concurrentKernel&lt;&lt;&lt;1, 512, 0, stream[i]&gt;&gt;&gt;(width, i, deviceArray); cudaMemcpyAsync(returnedArray+offset, deviceArray, sizeof(double)*streamSize, cudaMemcpyDeviceToHost, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("returnedArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } mexPrintf("returnedArray[offset]: %g, [end]: %g\n",returnedArray[offset/sizeof(double)],returnedArray[(i+1)*streamSize-1]); } for (int i = 0; i &lt; numberOfStreams; i++) { cudaStreamDestroy(stream[i]); } cudaFree(hostArray); cudaFree(deviceArray); } </code></pre> <p>When there is 2 streams, the result is an array of zeros, which makes me think its i'm doing something wrong with the memory. Can anyone explain what i am doing wrong? If anyone needs help compiling and running these from Matlab, i can provide the commands to do so.</p> <p>Update:</p> <pre><code>for (int i = 0; i &lt; numberOfStreams; i++) { offset = i*streamSize; mexPrintf("offset: %d, element: %d\n",offset*sizeof(double),offset); cudaMemcpyAsync(deviceArray, hostArray+offset, sizeof(double)*streamSize, cudaMemcpyHostToDevice, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("deviceArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } concurrentKernel&lt;&lt;&lt;1, 512, 0, stream[i]&gt;&gt;&gt;(width, i, deviceArray); } cudaDeviceSynchronize(); for (int i = 0; i &lt; numberOfStreams; i++) { offset = i*streamSize; mexPrintf("offset: %d, element: %d\n",offset*sizeof(double),offset); cudaMemcpyAsync(returnedArray+offset, deviceArray, sizeof(double)*streamSize, cudaMemcpyDeviceToHost, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("returnedArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } mexPrintf("returnedArray[offset]: %g, [end]: %g\n",returnedArray[offset/sizeof(double)],returnedArray[(i+1)*streamSize-1]); cudaStreamDestroy(stream[i]); } </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload