Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Memory transfer is an important factor when it comes to the performance of CUDA applications. <a href="http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__MEMORY_g9f93d9600f4504e0d637ceb43c91ebad.html#g9f93d9600f4504e0d637ceb43c91ebad" rel="nofollow"><code>cudaMallocHost</code></a> can do two things:</p> <ul> <li>allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in <code>cudaMemcpy</code> as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.</li> <li>allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the <code>cudaDeviceMapHost</code> flag using <a href="http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__DEVICE_gd986d35e3525da7f0fba505f2eb0fab6.html" rel="nofollow"><code>cudaSetDeviceFlags</code></a> before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.</li> </ul> <p>I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.</p> <p>If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with <a href="http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__MEMORY_g732efed5ab5cb184c920a21eb36e8ce4.html#g732efed5ab5cb184c920a21eb36e8ce4" rel="nofollow"><code>cudaMemcpyAsync</code></a> by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time. </p> <p>I believe your assertions about the usage scenario are correct when using <code>cudaDeviceMapHost</code> type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload