Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>There are a few ways to address CPU-GPU communication overhead - I hope that's what you mean by latency and not the latency of the transfer itself. Note that I deliberately used the term <em>address</em> instead of <em>reduce</em> as you do not necessarily need to reduce the latency if you can hide it. Also note that I am much more familiar with CUDA, so below I only refer to CUDA, but some features are also available in OpenCL.</p> <p>As you mentioned <strong>page-locked memory</strong> has the very purpose of increasing. Additionally, one can map page-locked host memory to the GPU, mechanism which enables direct access of the data allocated from the GPU kernel without the need for additional data-transfer. This mechanism is called <strong>zero-copy</strong> transfer and it is useful if data is read/written only once accompanied by a substantial amount of computation and for GPUs with no separate memory (mobile). However, if the kernel accessing the zero-copied data is not strongly compute-bound and therefore the latency of data access cannot be hidden, page-locked but not mapped memory will be more efficient. Additionally, if the data does not fit into the GPU memory, zero-copy will still work.<br> Note that excessive amount of page-locked memory can cause serious slowdown on the CPU side. </p> <p>Approaching the problem from a different angle, as tkerwin mentioned, <strong>asynchronous transfer</strong> (wrt the CPU thread talking to the GPU) is the key to hide CPU-GPU transfer latency by overlapping computation on the CPU with the transfer. This can be achieved with <code>cudaMemcpyAsync()</code> as well as using zero-copy with asynchronous kernel execution.<br> One can take this even further by using multiple streams to overlap transfer with kernel execution. Note that stream scheduling might need special attention for good overlapping; Tesla and Quadro cards have dual-DMA engine which enables simultaneous data transfer to and from GPU. Additionally, with CUDA 4.0 it became easier to use a GPU from multiple CPU threads, so in a multi-threaded CPU code each thread can send its own data to the GPU and launch kernels easier.</p> <p>Finally, <strong><a href="http://code.google.com/p/adsm">GMAC</a></strong> implements an asymmetric shared memory model for CUDA. One of its very interesting features is the coherency models it provides, in particular lazy- and rolling update enabling the transfer of only data modified on the CPU in a blocked fashion.<br> For more details see the following paper: <a href="http://code.google.com/p/adsm/downloads/detail?name=adsm-asplos10.pdf">Gelado et al. - An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems</a>.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload