Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Unfortunately this embarrassingly parallel algorithm is an embarrassingly bad example of how performant parallelism should be implemented. And since my crystall ball tells me that besides <code>i</code>, <code>temp</code> is also a shared <em>automatic</em> variable, I would assume it for the rest of this text. It also tells me that you have a pre-Nehalem CPU...</p> <p>There are two sources of slowdown here - code transformation and cache coherency.</p> <p>The way parallel regions are implmentend is that their code is extracted in separate functions. Shared local variables are extracted into structures that are then shared between the threads in the team that executes the parallel region. Under the OpenMP transformations your code sample would become something similiar to this:</p> <pre><code>typedef struct { int i; int temp; } main_omp_fn_0_shared_vars; void main_omp_fn_0 (void *data) { main_omp_fn_0_shared_vars *vars = data; // compute values of j_min and j_max for this thread for (j = j_min; j &lt; j_max; j++) { for (k = 0; k &lt; 2; k++) { vars-&gt;temp = vars-&gt;i * j * k; if (vars-&gt;i % 2 == 0) vars-&gt;temp = 0; } } int main (void) { int i, temp; main_omp_fn_0_shared_vars vars; for (i = 0; i &lt; 100; i++) { vars.i = i; vars.temp = temp; // This is how GCC implements parallel regions with libgomp // Start main_omp_fn_0 in the other threads GOMP_parallel_start(main_omp_fn_0, &amp;vars, 0); // Start main_omp_fn_0 in the main thread main_omp_fn_0(&amp;vars); // Wait for other threads to finish (implicit barrier) GOMP_parallel_end(); i = vars.i; temp = vars.temp; } } </code></pre> <p>You pay a small penalty for accessing <code>temp</code> and <code>i</code> this way as their intermediate values cannot be stored in registers but are loaded and stored each time.</p> <p>The other source of degradation is the cache coherency protocol. Accessing the same memory location from multiple threads executing on multiple CPU cores leads to lots of cache invalidation events. Worse, <code>vars.i</code> and <code>vars.temp</code> are likely to end up in the same cache line and although <code>vars.i</code> is only read from and <code>vars.temp</code> is only written to, full cache invalidation is likely to occur at each iteration of the inner loop.</p> <p>Normally access to shared variables is protected by explicit synchronisation constructs like atomic statements and critical sections and performance degradation is well expected in that case.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload