Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy is the call to array_view::synchronize() so slow?
    text
    copied!<p>i've started experimenting with C++ AMP. I've created a simple test app just to see what it can do, however the results are quite surprising to me. Consider the following code:</p> <pre><code>#include &lt;amp.h&gt; #include "Timer.h" using namespace concurrency; int main( int argc, char* argv[] ) { uint32_t u32Threads = 16; uint32_t u32DataRank = u32Threads * 256; uint32_t u32DataSize = (u32DataRank * u32DataRank) / u32Threads; uint32_t* pu32Data = new (std::nothrow) uint32_t[ u32DataRank * u32DataRank ]; for ( uint32_t i = 0; i &lt; u32DataRank * u32DataRank; i++ ) { pu32Data[i] = 1; } uint32_t* pu32Sum = new (std::nothrow) uint32_t[ u32Threads ]; Timer tmr; tmr.Start(); array&lt; uint32_t, 1 &gt; source( u32DataRank * u32DataRank, pu32Data ); array_view&lt; uint32_t, 1 &gt; sum( u32Threads, pu32Sum ); printf( "Array&lt;&gt; deep copy time: %.6f\n", tmr.Stop() ); tmr.Start(); parallel_for_each( sum.extent, [=, &amp;source](index&lt;1&gt; idx) restrict(amp) { uint32_t u32Sum = 0; uint32_t u32Start = idx[0] * u32DataSize; uint32_t u32End = (idx[0] * u32DataSize) + u32DataSize; for ( uint32_t i = u32Start; i &lt; u32End; i++ ) { u32Sum += source[i]; } sum[idx] = u32Sum; } ); double dDuration = tmr.Stop(); printf( "gpu computation time: %.6f\n", dDuration ); tmr.Start(); sum.synchronize(); dDuration = tmr.Stop(); printf( "synchronize time: %.6f\n", dDuration ); printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] ); tmr.Start(); for ( uint32_t idx = 0; idx &lt; u32Threads; idx++ ) { uint32_t u32Sum = 0; for ( uint32_t i = 0; i &lt; u32DataSize; i++ ) { u32Sum += pu32Data[(idx * u32DataSize) + i]; } pu32Sum[idx] = u32Sum; } dDuration = tmr.Stop(); printf( "cpu computation time: %.6f\n", dDuration ); printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] ); delete [] pu32Sum; delete [] pu32Data; return 0; } </code></pre> <p>Note that <code>Timer</code> is a simple timing class using QueryPerformanceCounter. Anyway, the output of the code is the following:</p> <pre><code>Array&lt;&gt; deep copy time: 0.089784 gpu computation time: 0.000449 synchronize time: 8.671081 first and second row sum = 1048576, 1048576 cpu computation time: 0.006647 first and second row sum = 1048576, 1048576 </code></pre> <p>Why is the call to synchronize() taking so long? Is there a way how to get around this? Other than that the performance of the computation performance is amazing, however the synchronize() overhead makes it unusable for me.</p> <p>It is also possible that i am doing something terribly wrong, if so, please tell me. Thanks in advance.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload