Note that there are some explanatory texts on larger screens.

plurals
  1. POC++ OpenMP slower than serial with default thread count
    primarykey
    data
    text
    <p>I try using OpenMP to parallel some for-loop of my program but failed to get significant speed improvement (actual degradation is observed). My target machine will have 4-6 cores and I currently rely on the OpenMP runtime to get the thread count for me, so I haven't tried any threadcount combination yet. </p> <ul> <li>Target/Development platform: Windows 64bits</li> <li>using MinGW64 4.7.2 (rubenvb build)</li> </ul> <p><b>Sample output with OpenMP</b></p> <pre><code>Thread count: 4 Dynamic :0 OMP_GET_NUM_PROCS: 4 OMP_IN_PARALLEL: 1 5.612 // &lt;- returned by omp_get_wtime() 5.627 (sec) // &lt;- returned by clock() Wall time elapsed: 5.62703 </code></pre> <p><b>Sample output without OpenMP</b></p> <pre><code>2.415 (sec) // &lt;- returned by clock() Wall time elapsed: 2.415 </code></pre> <p><b>How I measure the time</b></p> <pre><code>struct timeval start, end; gettimeofday(&amp;start, NULL); #ifdef _OPENMP double t1 = (double) clock(); double wt = omp_get_wtime(); sim-&gt;resetEnvironment(run); tout &lt;&lt; omp_get_wtime() - wt &lt;&lt; std::endl; timeEnd(tout, t1); #else double = (double) clock(); sim-&gt;resetEnvironment(run); timeEnd(tout, t1); #endif gettimeofday(&amp;end, NULL); tout &lt;&lt; "Wall time elapsed: " &lt;&lt; ((end.tv_sec - start.tv_sec) * 1000000u + (end.tv_usec - start.tv_usec)) / 1.e6 &lt;&lt; std::endl; </code></pre> <p><b>The code</b></p> <pre><code>void Simulator::resetEnvironment(int run) { #pragma omp parallel { // (a) #pragma omp for schedule(dynamic) for (size_t i = 0; i &lt; vector_1.size(); i++) // size ~ 20 reset(vector_1[i]); #pragma omp for schedule(dynamic) for (size_t i = 0; i &lt; vector_2.size(); i++) // size ~ 2.3M reset(vector_2[i]); #pragma omp for schedule(dynamic) for (size_t i = 0; i &lt; vector_3.size(); i++) // size ~ 0.3M reset(vector_3[i]); for (int level = 0; level &lt; level_count; level++) // (b) level = 3 { #pragma omp for schedule(dynamic) for (size_t i = 0; i &lt; vector_4[level].size(); i++) // size ~500 - 1K reset(vector_4[level][i]); } #pragma omp for schedule(dynamic) for (long i = 0; i &lt; populationSize; i++) // size ~7M resetAgent(agents[i]); } // end #parallel } // end: Simulator::resetEnvironment() </code></pre> <p><b>Randomness</b> Inside reset() function calls, I used a RNG for seeding some agents for subsequent tasks. Below is my RNG implementation, as I saw suggestion that using one RNG per per-thread for thread-safety.</p> <pre><code>class RNG { public: typedef std::mt19937 Engine; RNG() : real_uni_dist_(0.0, 1.0) #ifdef _OPENMP , engines() #endif { #ifdef _OPENMP int threads = std::max(1, omp_get_max_threads()); for (int seed = 0; seed &lt; threads; ++seed) engines.push_back(Engine(seed)); #else engine_.seed(time(NULL)); #endif } // end_ctor(RNG) /** @return next possible value of the uniformed distribution */ double operator()() { #ifdef _OPENMP return real_uni_dist_(engines[omp_get_thread_num()]); #else return real_uni_dist_(engine_); #endif } private: std::uniform_real_distribution&lt;double&gt; real_uni_dist_; #ifdef _OPENMP std::vector&lt;Engine&gt; engines; #else std::mt19937 engine_; #endif }; // end_class(RNG) </code></pre> <p>Question:</p> <ul> <li>at (a), is it good to not using shortcut 'parallel for' to avoid the overhead of creating teams?</li> <li>which part of my implementation can be the cause of degradation of performance?</li> <li>Why the time reported by clock() and omp_get_wtime() are so similar, as I expected clock() would be somehow longer than omp_get_wtime()</li> </ul> <p>[Edit]</p> <ul> <li>at (b), my intention of including OpenMP directive in the inner loop is that the iteration for outer loop is so small (only 3) so I think I can skip that and go directly to the inner loop of looping the vector_4[level]. Is this thought inappropriate (or will this instruct the OpenMP to repeat the outer loop by 4 and hence actually looping the inner loop 12 instead of 3 (say the current thread count is 4)?</li> </ul> <p>Thanks</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload