StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PODividing up CUDA cudaMemcpy into chunks
primarykey
Id
6820468
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2011-07-25T18:16:08
FavoriteCount
1
LastActivityDate
2012-11-09T10:00:41.507
LastEditDate
2011-07-28T21:26:09.510
LastEditorUserId
222488
OwnerUserId
222488
ParentId
0
PostTypeId
1
Score
1
ViewCount
2127
LastEditorDisplayName
text
Body
A co-worker and I were brainstorming on how to mitigate the memory transfer time between host and device and it came up that perhaps arranging things to one mega-transfer (i.e. one single call) might help. This led me to create a test case where I took timings of transferring few large data chunks vs. many small data data chunks. I got some very interesting/strange results, and was wondering if anyone here had an explanation? I won't put my whole code up here since it's quite long, but I tested the chunking in two different ways: <ol> <li>Explicitly writing out all cudaMemcpy's, e.g.: cudaEventRecord(start, 0); cudaMemcpy(aD, a, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 1*nBytes/10, a + 1*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 2*nBytes/10, a + 2*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 3*nBytes/10, a + 3*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 4*nBytes/10, a + 4*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 5*nBytes/10, a + 5*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 6*nBytes/10, a + 6*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 7*nBytes/10, a + 7*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 8*nBytes/10, a + 8*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaMemcpy(aD + 9*nBytes/10, a + 9*nBytes/10, nBytes/10, cudaMemcpyHostToDevice); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); </li> <li>Putting the cudaMemcpy's into a for loop: cudaEventRecord(start, 0); for(int i = 0; i < nChunks; i++) { cudaMemcpy(aD + i*nBytes/nChunks, a + i*nBytes/nChunks, nBytes/nChunks, cudaMemcpyHostToDevice); } cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); </li> </ol> To note, I also did a "warm-up" transfer at the start of each test just in case, though I don't think it was needed (the context was created by a cudaMalloc call). I tested this on total transfer sizes ranging from 1 MB to 1 GB, where each test case transferred the same amount of information regardless of how it was chunked up. A sample of my output is this: <blockquote> single large transfer = 0.451616 ms 10 explicit transfers = 0.198016 ms 100 explicit transfers = 0.691712 ms 10 looped transfers = 0.174848 ms 100 looped transfers = 0.683744 ms 1000 looped transfers = 6.145792 ms 10000 looped transfers = 104.981247 ms 100000 looped transfers = 13097.441406 ms </blockquote> What's interesting here and what I don't get is that, across the board, the 10 transfers were ALWAYS faster by a significant amount than any of the others, even the single large transfer! And that result stayed consistent no matter how large or small the data set was (i.e. 10x100MB vs 1x1GB or 10x1MB vs 1x10MB still results in the 10x being faster). If anyone has any insight on why this is or what I may be doing wrong to get these weird numbers, I would be very interested to hear what you have to say. Thanks! P.S. I know that cudaMemcpy carries with it an implicit synchronization and so I could have used a CPU timer and that cudaEventSynchronize is redundant, but I figured it was better to be on the safe side UPDATE: I wrote a function to try and take advantage of this apparent rip in the performance space-time continuum. When I use that function, though, which is written EXACLTY as in my test cases, the effect goes away and I see what I expect (a single cudaMemcpy is fastest). Perhaps this is all more akin to quantum physics than relativity wherein the act of observing changes the behavior...
Tags
<time><cuda><transfer><memcpy>
Title
Dividing up CUDA cudaMemcpy into chunks
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USamkessel
UserOwnerUserId
1. USamkessel
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PODividing up CUDA cudaMemcpy into chunks
 UserUserId
 USJun
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 PODividing up CUDA cudaMemcpy into chunks
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.