Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>To elaborate on my comment, this is the code that is most probably the root of your problem:</p> <pre><code>#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset) { ... last = GetSADFloatRel(adapted, &amp;fSamples.at(offset), 4*w+1, 4*w+1, 0); if(bestHit &gt; last) { </code></pre> <p><code>last</code> is only assigned to before it is read again so it is a good candidate for being a <code>lastprivate</code> variable, if you really need the value from the last iteration outside the parallel region. Otherwise just make it <code>private</code>.</p> <p>Access to <code>bestHit</code>, <code>cVal</code>, <code>rad</code>, and <code>veneOffset</code> should be synchronised by a critical region:</p> <pre><code>#pragma omp critical if (bestHit &gt; last) { bestHit = last; rad = (r+8)*0.25f; cVal = c * 2; veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f); if(fabs(veneOffset) &lt; 0.001) veneOffset = 0.0f; } </code></pre> <p>Note that by default all variables, except the counters of <code>parallel for</code> loops and those defined inside the parallel region, are shared, i.e. the <code>shared</code> clause in your case does nothing unless you also apply the <code>default(none)</code> clause.</p> <p>Another thing that you should be aware of is that in 32-bit mode Visual Studio uses x87 FPU math while in 64-bit mode it uses SSE math by default. x87 FPU does intermediate calculations using 80-bit floating point precision (even for calculations involving <code>float</code> only) while the SSE unit supports only the standard IEEE single and double precisions. Introducing OpenMP or any other parallelisation technique to a 32-bit x87 FPU code means that at certain points intermediate values should be converted back to the single precision of <code>float</code> and if done sufficiently many times a slight or significant difference (depending on the numerical stability of the algorithm) could be observed between the results from the serial code and the parallel one.</p> <p>Based on your code, I would suggest that the following modified code would give you good parallel performance because there is no synchronisation at each iteration:</p> <pre><code>#pragma omp parallel private(last) { int rBest = 0, kBest = 0, cBest = 0; float myBestHit = bestHit; #pragma omp for for(int r = 0; r &lt; 53; ++r) { for(int k = 0; k &lt; 3; ++k) { for(int c = 0; c &lt; 30; ++c) { for(int o = -1; o &lt;= 1; ++o) { /* r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel c: 0-29, in steps of 1, representing the absorption value (collagene) iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3) o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples */ int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2; if(offset &lt; 0 || offset == fSamples.size()) { continue; } last = GetSADFloatRel(adapted, &amp;fSamples.at(offset), 4*w+1, 4*w+1, 0); if(myBestHit &gt; last) { myBestHit = last; rBest = r; cBest = c; kBest = k; } last = GetSADFloatRel(input, &amp;fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0); if(myBestHit &gt; last) { myBestHit = last; rBest = r; cBest = c; kBest = k; } } } } } #pragma omp critical if (bestHit &gt; myBestHit) { bestHit = myBestHit; rad = (rBest+8)*0.25f; cVal = cBest * 2; veneOffset =(-0.5f + (1.0f / 3.0f) * kBest + (1.0f / 3.0f) / 2.0f); if(fabs(veneOffset) &lt; 0.001) veneOffset = 0.0f; } } </code></pre> <p>It only stores the values of the parameters that give the best hit in each thread and then at the end of the parallel region it computes <code>rad</code>, <code>cVal</code> and <code>veneOffset</code> based on the best values. Now there is only one critical region, and it is at the end of code. You can get around it also, but you would have to introduce additional arrays.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload