StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POcuda understanding concurrent kernel execution
primarykey
Id
12344223
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2012-09-10T00:51:13.020
FavoriteCount
2
LastActivityDate
2012-09-10T22:48:56.780
LastEditDate
2012-09-10T16:52:16.743
LastEditorUserId
681865
OwnerUserId
1224869
ParentId
0
PostTypeId
1
Score
5
ViewCount
3410
LastEditorDisplayName
text
Body
I'm trying to understand how concurrent kernel execution works. I have written a simple program to try to understand it. The kernel will populate a 2D array using 2 streams. I am getting the correct results when there is 1 stream, no concurrency. when i try it with 2 streams, attempt at concurrency, i get the wrong results. I believe it's either something to do with the memory transfer as i'm not quite sure i have this correct or the way i have set up the kernel. The programming guide does not explain it well enough for me. For my purposes, i need Matlab to be calling the kernel. As i understand it, the main program will: <ul> <li>allocate the pinned memory on host</li> <li>allocate the memory on the GPU required for a single stream (2 streams = half the total memory of the host)</li> <li>create the streams</li> <li>loop through the streams</li> <li>copy the memory for a single stream from host to the device using cudaMemcpyAsync()</li> <li>execute kernel for the stream</li> <li>copy the memory for the stream back to the host, cudaMemcpyAsync() <ul> <li>I believe i'm doing the right thing by referencing the memory from the location i need it for each stream using an offset based on the size of data for each stream and the stream number.</li> </ul></li> <li>destroy the streams</li> <li>free the memory</li> </ul> here is the code i am attempting to use. concurrentKernel.cpp <pre><code>__global__ void concurrentKernel(int const width, int const streamIdx, double *array) { int thread = (blockIdx.x * blockDim.x) + threadIdx.x;; for (int i = 0; i < width; i ++) { array[thread*width+i] = thread+i*width+1; // array[thread*width+i+streamIdx] = thread+i*width+streamIdx*width/2; } } </code></pre> concurrentMexFunction.cu <pre><code>#include <stdio.h> #include <math.h> #include "mex.h" /* Kernel function */ #include "concurrentKernel.cpp" void mexFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[]) { int const numberOfStreams = 2; // set number of streams to use here. cudaError_t cudaError; int offset; int width, height, fullSize, streamSize; width = 512; height = 512; fullSize = height*width; streamSize = (int)(fullSize/numberOfStreams); mexPrintf("fullSize: %d, streamSize: %d\n",fullSize, streamSize); /* Return the populated array */ double *returnedArray; plhs[0] = mxCreateDoubleMatrix(height, width, mxREAL); returnedArray = mxGetPr(plhs[0]); cudaStream_t stream[numberOfStreams]; for (int i = 0; i < numberOfStreams; i++) { cudaStreamCreate(&stream[i]); } /* host memory */ double *hostArray; cudaError = cudaMallocHost(&hostArray,sizeof(double)*fullSize); // full size of array. if (cudaError != cudaSuccess) {mexPrintf("hostArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } for (int i = 0; i < height; i++) { for (int j = 0; j < width; j++) { hostArray[i*width+j] = -1.0; } } /* device memory */ double *deviceArray; cudaError = cudaMalloc( (void **)&deviceArray,sizeof(double)*streamSize); // size of array for each stream. if (cudaError != cudaSuccess) {mexPrintf("deviceArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } for (int i = 0; i < numberOfStreams; i++) { offset = i;//*streamSize; mexPrintf("offset: %d, element: %d\n",offset*sizeof(double),offset); cudaMemcpyAsync(deviceArray, hostArray+offset, sizeof(double)*streamSize, cudaMemcpyHostToDevice, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("deviceArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } concurrentKernel<<<1, 512, 0, stream[i]>>>(width, i, deviceArray); cudaMemcpyAsync(returnedArray+offset, deviceArray, sizeof(double)*streamSize, cudaMemcpyDeviceToHost, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("returnedArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } mexPrintf("returnedArray[offset]: %g, [end]: %g\n",returnedArray[offset/sizeof(double)],returnedArray[(i+1)*streamSize-1]); } for (int i = 0; i < numberOfStreams; i++) { cudaStreamDestroy(stream[i]); } cudaFree(hostArray); cudaFree(deviceArray); } </code></pre> When there is 2 streams, the result is an array of zeros, which makes me think its i'm doing something wrong with the memory. Can anyone explain what i am doing wrong? If anyone needs help compiling and running these from Matlab, i can provide the commands to do so. Update: <pre><code>for (int i = 0; i < numberOfStreams; i++) { offset = i*streamSize; mexPrintf("offset: %d, element: %d\n",offset*sizeof(double),offset); cudaMemcpyAsync(deviceArray, hostArray+offset, sizeof(double)*streamSize, cudaMemcpyHostToDevice, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("deviceArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } concurrentKernel<<<1, 512, 0, stream[i]>>>(width, i, deviceArray); } cudaDeviceSynchronize(); for (int i = 0; i < numberOfStreams; i++) { offset = i*streamSize; mexPrintf("offset: %d, element: %d\n",offset*sizeof(double),offset); cudaMemcpyAsync(returnedArray+offset, deviceArray, sizeof(double)*streamSize, cudaMemcpyDeviceToHost, stream[i]); if (cudaError != cudaSuccess) {mexPrintf("returnedArray memory allocation failed\n********** Error: %s **********\n",cudaGetErrorString(cudaError)); return; } mexPrintf("returnedArray[offset]: %g, [end]: %g\n",returnedArray[offset/sizeof(double)],returnedArray[(i+1)*streamSize-1]); cudaStreamDestroy(stream[i]); } </code></pre>
Tags
<concurrency><cuda>
Title
cuda understanding concurrent kernel execution
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. UStalonmies
UserOwnerUserId
1. USBeau Bellamy
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POcuda understanding concurrent kernel execution
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POcuda understanding concurrent kernel execution
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POcuda understanding concurrent kernel execution
 UserUserId
 USMark Ebersole
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.