StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POGPU for loops: avoid warp divergence & implicit syncthreads
text
Body
copied!<p>My situation: each thread in a warp operates on its own completely independent & distinct data array. All threads loop over their data array. The number of loop iterations is different for each thread. (This incurs a cost, I know).</p> <p>Within the for loop, each thread needs to save the maximum value after calculating three floats. After the for-loop, threads in warp will "communicate" by checking the maximum value calculated by only their "neighboring thread" in the <em>warp</em> (determined by parity).</p> <p><strong>Questions:</strong></p> <ol> <li>If I avoid the conditionals in a "max" operation by doing multiplication, this will avoid warp divergence, right? (see example code below)</li> <li>The extra multiplication operations mentioned in (1.) are worth it, right? - i.e. far faster than any sort of warp divergence.</li> <li>The same mechanism that causes warp divergence (one set of instructions for all threads) can be exploited as an implicit "thread barrier" (for the <em>warp</em>) at the end of the for-loop (much the same way as with an "#pragma omp for" statement in non-gpu computing). Thus I don't need to make a "syncthreads" call for a <em>warp</em> after the for loop before one thread checks the value saved by another thread, right? (This would be because "synthreads" is only for the "entire GPU", i.e. inter-warp and inter-MP, right?)</li> </ol> <p>example code:</p> <pre><code>__shared__ int N_per_data; // loaded from host __shared__ float ** data; //loaded from host data = new float*[num_threads_in_warp]; for (int j = 0; j < num_threads_in_warp; ++j) data[j] = new float[N_per_data[j]]; // the values of jagged matrix "data" are loaded from host. __shared__ float **max_data = new float*[num_threads_in_warp]; for (int j = 0; j < num_threads_in_warp; ++j) max_data[j] = new float[N_per_data[j]]; for (uint j = 0; j < N_per_data[threadIdx.x]; ++j) { const float a = f(data[threadIdx.x][j]); const float b = g(data[threadIdx.x][j]); const float c = h(data[threadIdx.x][j]); const int cond_a = (a > b) && (a > c); const int cond_b = (b > a) && (b > c); const int cond_c = (c > a) && (c > b); // avoid if-statements. question (1) and (2) max_data[threadIdx.x][j] = conda_a * a + cond_b * b + cond_c * c; } // Question (3): // No "syncthreads" necessary in next line: // access data of your mate at some magic positions (assume it exists): float my_neighbors_max_at_7 = max_data[threadIdx.x + pow(-1,(threadIdx.x % 2) == 1) ][7]; </code></pre> <p>Before implementing my algorithm on a GPU, I am investigating every aspect of the algorithm to ensure that it will be worth the implementation effort. So please bear with me..</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload