Note that there are some explanatory texts on larger screens.

plurals
  1. USjxj
    primarykey
    data
    text
    plurals
    1. COIn CUDA_C_Programming_Guide 3.2.4.3: "Such a block has therefore in general two addresses: ...". But soon after that: "The only exception is for pointers ... Unified Virtual Address Space". In 3.2.7. Unified Virtual Address Space, the 64bits and compute capability 2.0 can be met easily now. By using device side pointer, there is no way for us to know if the real implementation is the "good" or "bad" as you said. In Nvida's profiling tool, there is only kernel execution period shown, not any other transfer period can be seen because we don't use writing buffer explicitly.
      singulars
    2. COHi, thanks for your replying. But I don't think OpenCL2.0 works for my original purpose. Neither CL_USE_HOST_PTR nor CL_MEM_ALLOC_HOST_PTR in OpenCL1.2. I actually inspected Nvidia's SDK examples and best practice guide, then I realize that all OpenCL tricks plays before kernel launch, so this doesn't help overlapping transfer and computation (CUDA's device side pointer of host memory does, because transfer begins after kernel launch on kernel's demand)
      singulars
    3. COIt is a good manner for overlapping transfer and computation. Explicit copy operation issued from host side is avoided, which can only copy data from host to device global memory. By device side pointer, data can be transfered between host and shared memory directly. It allows device schedule computation and data transfer as its need, which implies data transfer may be hidden. Traditional way is multi streams (CUDA) and multi command queues(OpenCL). The traditional way needs explicit scheduling in host side, which makes the overall code a little bit ugly/hairy.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload