Note that there are some explanatory texts on larger screens.

plurals
  1. POFast(est) way to write a seqence of integer to global memory?
    primarykey
    data
    text
    <p>The task is very simple, writting a seqence of integer variable to memory:</p> <p>Original code:</p> <pre><code>for (size_t i=0; i&lt;1000*1000*1000; ++i) { data[i]=i; }; </code></pre> <p>Parallelized code:</p> <pre><code> size_t stepsize=len/N; #pragma omp parallel num_threads(N) { int threadIdx=omp_get_thread_num(); size_t istart=stepsize*threadIdx; size_t iend=threadIdx==N-1?len:istart+stepsize; #pragma simd for (size_t i=istart; i&lt;iend; ++i) x[i]=i; }; </code></pre> <p>The performance sucks, it takes <strong>1.6 sec</strong> to writing 1G <code>uint64</code> variables (which is equal to 5GB per sec), by simple parallelization (<code>open mp parallel</code>)of the above code, the speed increase abit, but performance still sucks, take <strong>1.4 sec</strong> with 4 threads and 1.35 with 6 threads on a i7 3970.</p> <p>The theortical memory bandwidth of my rig (<strong>i7 3970/64G DDR3-1600</strong>) is <strong>51.2 GB/sec</strong>, for the above example, the achieved memory bandwidth is only about <strong>1/10</strong> of the theoritcal bandwidth, even through the application is pretty much memory-bandwidth-bounded.</p> <p>Anyone know how to improve the code?</p> <p>I wrote alot of memory-bound code on GPU, its pretty easy for GPU to take full advantage of the GPU's device memory bandwidth (e.g. 85%+ of theoritcal bandwidth).</p> <p><strong>EDIT:</strong></p> <p>The code is compiled by Intel ICC 13.1, to 64bit binary, and with maximum optimzation (O3) and AVX code path on, as well as auto-vectorization.</p> <p><strong>UPDATE:</strong></p> <p>I tried all the codes below ( thanks to Paul R), nothing special happens, I believe the compiler is fully capable of doing the kind of simd/vectorization optimization.</p> <p>As for why I want to fill the numbers there, well, long story short: </p> <p>Its part of a high-performance heterogeneous computing algorthim, on the device side, the algorthim is highly efficient to the degree that the multi-GPU set is so fast such that I found the performance bottleneck happen to be when CPU try to write several seqence of numbers to memory.</p> <p>Of cause, knowing that CPU sucks at filling numbers (in contrast, the GPU can fill seqence of number at a speed very close (<strong>238GB/sec</strong> out of <strong>288GB/sec</strong> on GK110 vs a pathetic <strong>5GB/sec</strong> out of <strong>51.2GB/sec</strong> on CPU) to the theorical bandwidth of GPU's global memory), I could change my algorthim a bit, but what make me wonder is why CPU sucks so bad at filling seqence of numbers here. </p> <p>As for memory bandwidth of my rig, I believe the bandwidth (51.2GB) is about correct, based on my <code>memcpy()</code> test, the achieved bandwidth is about <strong>80%+</strong> of the theoritical bandwidth (<strong>>40GB/sec</strong>).</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload