Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I didn't actually run your code, but I see an immediate mistake on <code>p</code>, which should be <code>private</code> not <code>shared</code>. The parallel invocation of <code>qs</code>: <code>qs(v, p.first, p.second);</code> will have races on <code>p</code>, resulting in unpredictable behavior. The local variables at <code>qs</code> should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.</p> <hr> <p>Here are my general comments for the implementation of parallel quicksort. Quicksort itself is <em>embarrassingly parallel</em>, which means no synchronization is needed. The recursive calls of <code>qs</code> on a partitioned array is embarrassingly parallel.</p> <p>However, the parallelism is exposed in a <em>recursive</em> form. If you simply use the <em>nested</em> parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.</p> <p>For your approach, there is a good benchmark: OmpSCR. You can download at <a href="http://sourceforge.net/projects/ompscr/" rel="noreferrer">http://sourceforge.net/projects/ompscr/</a></p> <p>In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's <code>s</code>). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.</p> <hr> <p>However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's <code>task</code>.</p> <p>Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):</p> <pre><code>void qsort_omp_recursive(int* begin, int* end) { if (begin != end) { // Partition ... // Throttling if (...) { qsort_omp_recursive(begin, middle); qsort_omp_recursive(++middle, ++end); } else { #pragma omp parallel sections nowait { #pragma omp section qsort_omp_recursive(begin, middle); #pragma omp section qsort_omp_recursive(++middle, ++end); } } } } </code></pre> <p>In order to run this code, you need to call <code>omp_set_nested(1)</code> and <code>omp_set_num_threads(2)</code>. The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.</p> <hr> <p>Finally, you may use OpenMP 3.0's <code>task</code>, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two <em>physical</em> threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, <code>task</code> separates logical tasks and workers.</p> <p>Because OpenMP 3.0 is not popular yet, I will use <em>Cilk Plus</em>, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:</p> <pre><code>void qsort(int* begin, int* end) { if (begin != end) { --end; int* middle = std::partition(begin, end, std::bind2nd(std::less&lt;int&gt;(), *end)); std::swap(*end, *middle); cilk_spawn qsort(begin, middle); qsort(++middle, ++end); // cilk_sync; Only necessay at the final stage. } } </code></pre> <p>I copied this code from Cilk Plus' example code. You will see a single keyword <code>cilk_spawn</code> is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.</p> <p>Note that OpenMP 3.0's <code>task</code> is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.</p> <p>The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload