Note that there are some explanatory texts on larger screens.

plurals
  1. POParallelizing nested loops with dependencies
    primarykey
    data
    text
    <p>My question is about how best to structure my (C++) code to support parallelizing a time consuming computation. The pseudocode in question has the following structure:</p> <pre><code>for a_1(y_1) in A_1 for a_2(y_2) in A_2(a_1) ... for a_n(y_n) in A_n(a_1, ..., a_{n-1}) y_n = f(a_1, ..., a_n) y_{n-1} = g_n(Y_n) ... y_1 = g_2(Y_2) . </code></pre> <p>Roughly speaking, each loop iterates over elements in a set <code>A_i</code>, the successive elements of which are dependent on feedback <code>y_i</code> from previous iterations. In other words, to determine the next <code>a_i</code>, we must have finished all computations on the current <code>a_i</code>. Furthermore, interior sets depend on the outer iterations. Written in a recursive form:</p> <pre><code>Iterate(A_i, a_1, ..., a_{i-1}): for a_i(h_i) in A_i Y_i += Iterate(A_{i+1}, a_1, ..., a_i) return g(Y_i) Iterate(any, a_1, ..., a_n): return f(a_1, ..., a_n) Iterate(A_1) </code></pre> <p>Assume that f(...) is a time-consuming computation, and that the feedback functions g(...) are simple (fast). Now, if all the sets <code>A_i</code> are "large", then the problem is embarrassingly parallelizable. Currently, I have a thread pool and just toss the computations of the inner-most loop into the pool. The problem is, very often the inner-most loop is an iteration over a singleton, so the thread pool only ever has one running thread in it. I have thought about using futures to return the values to outer loops, but that would require futures-of-futures, etc. and it gets pretty messy (I think).</p> <p>I realize that the structure I have listed above is pretty complicated, so there are a number of simplifying cases I am also interested in:</p> <ol> <li><code>a_i(h_i) = a_i</code> ; independent of h_i</li> <li><code>A_i(a_1, ..., a_{i-1}) = A_i</code> ; independent of a_1, ... a_{i-1}</li> <li><code>g_i = 0</code> ; independent of H_{i+1}</li> <li>All outer loops are "large"; the number of elements in those sets is much greater than the number of cores.</li> </ol> <p>Now, in practice, n &lt;= 3, and item 1 holds for all outer loops, and items 2-4 all hold, so particular solutions for that case are sufficient. But since I am bothering to ask the question here, I am interested in getting ideas for how to deal with the additional complexity for more general problems if possible.</p> <hr> <p><strong>Edit:</strong></p> <p>Cleaned up the first pseudocode block to make it consistent with the other. Since people cannot understand my mathematical notation, here is a more concrete simple example:</p> <pre><code>#include &lt;cmath&gt; #include &lt;iostream&gt; #include &lt;vector&gt; using namespace std; double f(double a1, int a2, double a3){ // Very slow function cout &lt;&lt; a1 &lt;&lt; ", " &lt;&lt; a2 &lt;&lt; ", " &lt;&lt; a3 &lt;&lt; endl; return pow(a1*a3, a2) + a1 + a2 + a3; // just some contrived example } int g2(const vector&lt;double&gt; &amp;Y3){ // average-ish double sum = 0; for(int i = 0; i &lt; Y3.size(); ++i){ sum += Y3[i]; } return int(sum / (Y3.size() + 1)); } double g1(const vector&lt;int&gt; &amp;Y2){ // return 1/(min(Y2)+1.0) int imin = 0; int minval = 0; for(int i = 1; i &lt; Y2.size(); ++i){ if(Y2[i] &lt; minval){ imin = i; minval = Y2[imin]; } } return 1.0/double(minval+1.0); } int main(){ for(double a1 = 0.0, h1 = 10.0; a1 &lt; 1.0; a1 += h1){ // for a1 in A1 vector&lt;int&gt; Y2; for(int a2 = 2, h2 = 1; a2 &lt;= (int)(5*a1+10); a2 += h2){ // for a2 in A2(a1) vector&lt;double&gt; Y3; for(double a3 = 6.0, h3 = 1.0; a3 &gt;= (a1+a2); a3 -= 0.5*h3){ // for a3 in A2(a1, a2) h3 = f(a1, a2, a3); Y3.push_back(h3); } h2 = g2(Y3); Y2.push_back(h2); } h1 = g1(Y2); } return 0; } </code></pre> <p>I picked the values randomly, and it turns out <code>f</code> is only evaluated 3 times. Note that the above code is NOT parallelizable. <strong>Assume that it is possible to query if a loop's incrementation depends on higher up loops.</strong></p> <p>I should clarify what I'm after as well. When I originally said structure, I perhaps should have said parallelization methodology or something like that. For example, my first attempt to parallelize was to throw the inner-most calls to <code>f</code> into a thread pool and join at the end of the inner-most loop. As mentioned above, this doesn't work when the inner-most loop iterates only over one element. This didn't really require restructuring the existing code significantly, and I would like to avoid it if possible.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload