StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>First of all, your <strong>algorithm is memory bandwidth bounded</strong>. That is memory load/store would outweigh any index calculations you do.</p> <p>Vector operations like <code>SSE</code>/<code>AVX</code> would not help either - you are not doing any intensive calculations.</p> <p>Increasing work amount per iteration is also useless - both <code>PPL</code> and <code>TBB</code> are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from <a href="http://threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_for_func.htm" rel="noreferrer"><code>TBB::parallel_for</code></a>:</p> <blockquote> <p>When worker threads are available, <code>parallel_for</code> executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, <strong>do expect parallel_for to tend towards operating on consecutive runs of values</strong>. </p> </blockquote> <p>What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your <code>memset</code> or do it in parallel too.</p> <p>You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate <code>size</code> of your input and output, measure <code>time</code> of algorithm, divide <code>size</code>/<code>time</code> and compare result with maximal characteristics of your system (for instance, measure with <a href="http://www.sisoftware.co.uk/" rel="noreferrer">benchmark</a>).</p> <p>I have made test for <code>Microsoft PPL</code>, <code>OpenMP</code> and <code>Native for</code>, results are (I used 8x of your height):</p> <pre><code>Native_For 0.21 s OpenMP_For 0.15 s Intel_TBB_For 0.15 s MS_PPL_For 0.15 s </code></pre> <p>If remove <code>memset</code> then:</p> <pre><code>Native_For 0.15 s OpenMP_For 0.09 s Intel_TBB_For 0.09 s MS_PPL_For 0.09 s </code></pre> <p>As you can see <code>memset</code> (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.</p> <p><a href="http://ideone.com/PU8mMb" rel="noreferrer"><strong>FULL SOURCE CODE</strong></a>:</p> <pre><code>#include <boost/exception/detail/type_info.hpp> #include <boost/mpl/for_each.hpp> #include <boost/mpl/vector.hpp> #include <boost/progress.hpp> #include <tbb/tbb.h> #include <iostream> #include <ostream> #include <vector> #include <string> #include <omp.h> #include <ppl.h> using namespace boost; using namespace std; const auto Width = 3264; const auto Height = 2540*8; struct MS_PPL_For { template<typename F,typename Index> void operator()(Index first,Index last,F f) const { concurrency::parallel_for(first,last,f); } }; struct Intel_TBB_For { template<typename F,typename Index> void operator()(Index first,Index last,F f) const { tbb::parallel_for(first,last,f); } }; struct Native_For { template<typename F,typename Index> void operator()(Index first,Index last,F f) const { for(; first!=last; ++first) f(first); } }; struct OpenMP_For { template<typename F,typename Index> void operator()(Index first,Index last,F f) const { #pragma omp parallel for for(auto i=first; i<last; ++i) f(i); } }; template<typename T> struct ConvertBayerToRgbImageAsIs { const T* BayerChannel; T* RgbChannel; template<typename For> void operator()(For for_) { cout << type_name<For>() << "\t"; progress_timer t; int offsets[] = {2,1,1,0}; //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T)); for_(0, Height, [&] (int row) { for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++) { auto offset = (row % 2)*2 + (col % 2); //0...3 auto rgbIndex = bayerIndex * 3 + offsets[offset]; RgbChannel[rgbIndex] = BayerChannel[bayerIndex]; } }); } }; int main() { vector<float> bayer(Width*Height); vector<float> rgb(Width*Height*3); ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]}; for(auto i=0;i!=4;++i) { mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work); cout << string(16,'_') << endl; } } </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload