Note that there are some explanatory texts on larger screens.

plurals
  1. POCudaMemcpy of a single bool takes way too long
    primarykey
    data
    text
    <p>I am writing K-means clustering on CUDA and I run to very weird problem. I need to set a bool to false at the begining of clustering iteration and then read it back. However, the memcpy of bool takes waaaaaaay too long compared to other stuff, see the graph.</p> <p>Yellow (top-most) <strong>~3 ms</strong>: Initial cudaMalloc and cudaMemcpy of all data (about 7MB of floats), total 6 arrays.</p> <p>Red <strong>~2 ms</strong>: Copy of a single boolean</p> <p>Blue <strong>~10 ms</strong>: Clustering itself</p> <p>Green <strong>~4 ms</strong>: Parallel reduction - summation of per-thread result to one result (about 20 kernel invocations on average)</p> <p>Purple/brown <strong>&lt; 1 ms</strong>: cudaFree</p> <p><img src="https://i.stack.imgur.com/UMavK.png" alt="Graph of times"></p> <p>The code itself looks like this:</p> <pre><code>bool changed = false; ... float start = sdkGetTimerValue(&amp;stopWatch); cudaMemcpy(dev_changed, &amp;changed, sizeof(bool), cudaMemcpyHostToDevice); float end = sdkGetTimerValue(&amp;stopWatch); ... </code></pre> <p>For time measurement I am using TimerHelper.h from CUDA examples that internally uses <code>QueryPerformanceCounter</code> for time measurement. I checked the time measurement about 10 times because I could not believe this. If I take the cudaMemcpy out, the tome goes near to 0 (unlike 2 ms).</p> <p>The bizarre thing is when I compared the boolean copy times (red) with Parallel reduction (green) which is 1 memcpy vs. 20 kernel invocations.</p> <p>So I tried to write a kernel that just sets the bool to false:</p> <pre><code>__global__ void SetToFalse(bool* boolean) { boolean[0] = false; } void LaunchSetToFalse(bool* boolean) { SetToFalse&lt;&lt;&lt;1, 1&gt;&gt;&gt;(boolean); } </code></pre> <p>And then changed the code to:</p> <pre><code>... float start = sdkGetTimerValue(&amp;stopWatch); LaunchSetToFalse(dev_changed); cudaDeviceSynchronize(); float end = sdkGetTimerValue(&amp;stopWatch); ... </code></pre> <p>But it still takes about ~2ms (no change).</p> <p>Am I missing something obvious? What makes the bool copy so slow? Also, the green contains copying of the boolean from GPU to CPU and it takes about half of the green block. I very carefully checked the time measurement and there should not be any error; however, the results are so weird.</p> <p>Is there any better way how to report a bool from threads? Thanks in advance for any suggestions.</p> <p>EDIT: Here is bigger chunk of the code:</p> <pre><code>cudaMalloc((void**) &amp;dev_data, data.getRawSize() * sizeof(float)); cudaMalloc((void**) &amp;dev_linearizedClusterCenters, clustersCount * size_dim * sizeof(float)); cudaMalloc((void**) &amp;dev_outClusters, rowsCount * sizeof(byte)); cudaMalloc((void**) &amp;dev_changed, sizeof(bool)); cudaMalloc((void**) &amp;dev_count, numOfThreadBlocks * clustersCount * sizeof(int)); cudaMalloc((void**) &amp;dev_newData, numOfThreadBlocks * clustersCount * size_dim * sizeof(float)); cudaMemcpy(dev_data, data.getDataPointer() , data.getRawSize() * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dev_outClusters, outClusters , rowsCount * sizeof(byte), cudaMemcpyHostToDevice); cudaMemcpy(dev_linearizedClusterCenters, inOutLinearizedClusterCenters , clustersCount * size_dim * sizeof(float), cudaMemcpyHostToDevice); perfRecorder.endInit(); for (int global_count = 0; (global_count &lt; maxIterations) &amp;&amp; changed[0]; global_count++) { perfRecorder.startIteration(); /* &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; THIS GUY TAKES WAY TOO LONG &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; */ changed[0] = false; cudaMemcpy(dev_changed, changed, sizeof(bool), cudaMemcpyHostToDevice); //LaunchSetToFalse(dev_changed); //cudaDeviceSynchronize(); perfRecorder.startIterationCompute(); LaunchKernel(dev_data, dev_linearizedClusterCenters, dev_outClusters, dev_changed, dev_count, dev_newData, clustersCount, size_dim, rowsCount, numOfThreads, numOfBlocks); cudaDeviceSynchronize(); perfRecorder.endIterationCompute(); cudaMemcpy(changed, dev_changed, sizeof(bool), cudaMemcpyDeviceToHost); LaunchParallelSummationInt(dev_count, numOfThreadBlocks * clustersCount, numOfThreads, parallelReductionIterations); LaunchParallelSummationFloat(dev_newData, numOfThreadBlocks * clustersCount * size_dim, numOfThreads, parallelReductionIterations); LaunchCountNewClusterCenters(dev_newData, dev_count, clustersCount, size_dim, dev_linearizedClusterCenters, numOfThreads); cudaDeviceSynchronize(); // &lt;&lt;&lt;&lt;&lt;&lt;&lt; EDIT 2 perfRecorder.endIteration(); }// End of for perfRecorder.startCleanup(); cudaMemcpy(outClusters, dev_outClusters, rowsCount * sizeof(byte), cudaMemcpyDeviceToHost); cudaFree(dev_data); cudaFree(dev_linearizedClusterCenters); cudaFree(dev_outClusters); cudaFree(dev_changed); cudaFree(dev_count); cudaFree(dev_newData); perfRecorder.endCleanup(); </code></pre> <p>EDIT 2: As @Robert Crovella correctly suggested, I put sync just bfore <code>perfRecorder.endIteration();</code>. The graph got better but still, transfer of 1 byte takes quite a bit of time (~1 ms):</p> <p><img src="https://i.stack.imgur.com/jpmiG.png" alt="Bench after sync"></p> <p>EDIT 3: I am using Windows and the timing code just saves number of milliseconds as a float and then subtracts them. I am ignoring the delay of stopwatch.</p> <pre><code>/****** Code from NVIDIA sample code in TimeHelper.h ******/ inline float StopWatchWin::getTime() { // Return the TOTAL time to date float retval = total_time; if (running) { LARGE_INTEGER temp; QueryPerformanceCounter((LARGE_INTEGER *) &amp;temp); retval += (float)(((double)(temp.QuadPart - start_time.QuadPart)) / freq); } return retval; } </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COlinux or windows? what code (exactly) precedes the `cudaMemcpy` ? The `cudaMemcpy` (or equivalently, a kernel) would have to wait until all previous cuda operations in the stream are complete. Are you issuing any async or kernel calls previous to the `cudaMemcpy`? Unfortunately, the code you've omitted from your description probably matters.
      singulars
    2. CO@Robert Crovella - Thanks for the comment, I posted larger chunk of the code. But what you say makes kind of sense - could it be that the memcpy actually waits for the `LaunchCountNewClusterCenters` kernel? So I am actually measuring that? It is also weird because that operation should be very very fast as well (~100 threads doing two divisions).
      singulars
    3. COYes, that is what is happening. To prove it, place a `cudaDeviceSynchronize();` right before the `perfRecorder.endIteration();` line. If you don't have any synchronization along the way, you may even be waiting for kernels prior to `LaunchCountNewClusterCenters`. Kernel calls are asynchronous. Any timing code that comes immediately after a kernel call, with no intervening synchronization, is likely to produce confusing results. This is especially true when using a host-based timing function like `QueryPerformanceCounter`, instead of `cudaEvent` based timing.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload