StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>This is a cute problem: with the code as above (gcc4.4, Intel i7) with 4 threads I get</p> <pre><code>OpenMP threads: 4 Time with stack array: 1.696 sec, checksum=1073741824 (must be 1073741824). Time with heap array: 5.413 sec, checksum=1073741824 (must be 1073741824). </code></pre> <p>but if I change the malloc line to</p> <pre><code> int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024); </code></pre> <p>(<strong>Update</strong>: or even</p> <pre><code> int* bins_thread_heap=(int*)malloc(sizeof(int)*16); </code></pre> <p>)</p> <p>then I get </p> <pre><code>OpenMP threads: 4 Time with stack array: 1.578 sec, checksum=1073741824 (must be 1073741824). Time with heap array: 1.574 sec, checksum=1073741824 (must be 1073741824). </code></pre> <p>The problem here is <a href="http://en.wikipedia.org/wiki/False_sharing" rel="noreferrer">false sharing</a>. The default malloc is being very (space-) efficient, and putting the requested small allocations all in one block of memory, next to each other; but since the allocations are so small that multiple fit in the same cache line, that means every time one thread updates its values, it dirties the cache line of the values in neighbouring threads. By making the requested memory large enough, this is no longer an issue.</p> <p>Incidentally, it should be clear why the stack-allocated case does not see this problem; different threads - different stacks - memory far enough appart that false sharing isn't an issue.</p> <p>As a side point -- it doesn't really matter for M of the size you're using here, but if your M (or number of threads) were larger, the omp critical would be a big serial bottleneck; you can use <a href="https://computing.llnl.gov/tutorials/openMP/#REDUCTION" rel="noreferrer">OpenMP reductions</a> to sum the checksum more efficiently</p> <pre><code>#pragma omp parallel reduction(+:checksum) { // Each openmp thread should have a private copy of // bins_thread_heap on the heap: int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024); for (int j=0; j<M; j++) bins_thread_heap[j]=0; #pragma omp for for (int i=0; i<N; i++) { // Accumulating every M-th number in respective array element const int j=i%M; bins_thread_heap[j]++; } for (int j=0; j<M; j++) checksum+=bins_thread_heap[j]; free(bins_thread_heap); } </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload