StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
6514816
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2011-06-29T01:20:55.903
FavoriteCount
0
LastActivityDate
2011-06-29T01:20:55.903
LastEditDate
LastEditorUserId
0
OwnerUserId
453868
ParentId
6500905
PostTypeId
2
Score
14
ViewCount
0
LastEditorDisplayName
text
Body
There are a few ways to address CPU-GPU communication overhead - I hope that's what you mean by latency and not the latency of the transfer itself. Note that I deliberately used the term address instead of reduce as you do not necessarily need to reduce the latency if you can hide it. Also note that I am much more familiar with CUDA, so below I only refer to CUDA, but some features are also available in OpenCL. As you mentioned page-locked memory has the very purpose of increasing. Additionally, one can map page-locked host memory to the GPU, mechanism which enables direct access of the data allocated from the GPU kernel without the need for additional data-transfer. This mechanism is called zero-copy transfer and it is useful if data is read/written only once accompanied by a substantial amount of computation and for GPUs with no separate memory (mobile). However, if the kernel accessing the zero-copied data is not strongly compute-bound and therefore the latency of data access cannot be hidden, page-locked but not mapped memory will be more efficient. Additionally, if the data does not fit into the GPU memory, zero-copy will still work. Note that excessive amount of page-locked memory can cause serious slowdown on the CPU side. Approaching the problem from a different angle, as tkerwin mentioned, asynchronous transfer (wrt the CPU thread talking to the GPU) is the key to hide CPU-GPU transfer latency by overlapping computation on the CPU with the transfer. This can be achieved with <code>cudaMemcpyAsync()</code> as well as using zero-copy with asynchronous kernel execution. One can take this even further by using multiple streams to overlap transfer with kernel execution. Note that stream scheduling might need special attention for good overlapping; Tesla and Quadro cards have dual-DMA engine which enables simultaneous data transfer to and from GPU. Additionally, with CUDA 4.0 it became easier to use a GPU from multiple CPU threads, so in a multi-threaded CPU code each thread can send its own data to the GPU and launch kernels easier. Finally, <a href="http://code.google.com/p/adsm">GMAC</a> implements an asymmetric shared memory model for CUDA. One of its very interesting features is the coherency models it provides, in particular lazy- and rolling update enabling the transfer of only data modified on the CPU in a blocked fashion. For more details see the following paper: <a href="http://code.google.com/p/adsm/downloads/detail?name=adsm-asplos10.pdf">Gelado et al. - An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems</a>.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POTechniques to Reduce CPU to GPU Data Transfer Latency
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USpszilard
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POTechniques to Reduce CPU to GPU Data Transfer Latency
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.