Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I think <a href="https://stackoverflow.com/a/16667228/912144">this answer</a> describes the reason sufficiently, but I'll expand a bit here.</p> <p>Before, however, here's <a href="http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/C-Dialect-Options.html#C-Dialect-Options" rel="nofollow noreferrer">gcc 4.8's documentation on <code>-fopenmp</code></a>:</p> <blockquote> <p><code>-fopenmp</code><br> Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 <a href="http://www.openmp.org/" rel="nofollow noreferrer">http://www.openmp.org/</a>. This option implies -pthread, and thus is only supported on targets that have support for -pthread.</p> </blockquote> <p>Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.</p> <p>The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:</p> <pre><code>int *b = ... int *c = ... int a = 0; #omp parallel for reduction(+:a) for (i = 0; i &lt; 100; ++i) a += b[i] + c[i]; </code></pre> <p>This code should be converted to something like this:</p> <pre><code>struct __omp_func1_data { int start; int end; int *b; int *c; int a; }; void *__omp_func1(void *data) { struct __omp_func1_data *d = data; int i; d-&gt;a = 0; for (i = d-&gt;start; i &lt; d-&gt;end; ++i) d-&gt;a += d-&gt;b[i] + d-&gt;c[i]; return NULL; } ... for (t = 1; t &lt; nthreads; ++t) /* create_thread with __omp_func1 function */ /* for master thread, don't create a thread */ struct master_data md = { .start = /*...*/, .end = /*...*/ .b = b, .c = c }; __omp_func1(&amp;md); a += md.a; for (t = 1; t &lt; nthreads; ++t) { /* join with thread */ /* add thread_data-&gt;a to a */ } </code></pre> <p>Now if we run this with <code>nthreads==1</code>, the code effectively gets reduced to:</p> <pre><code>struct __omp_func1_data { int start; int end; int *b; int *c; int a; }; void *__omp_func1(void *data) { struct __omp_func1_data *d = data; int i; d-&gt;a = 0; for (i = d-&gt;start; i &lt; d-&gt;end; ++i) d-&gt;a += d-&gt;b[i] + d-&gt;c[i]; return NULL; } ... struct master_data md = { .start = 0, .end = 100 .b = b, .c = c }; __omp_func1(&amp;md); a += md.a; </code></pre> <p>So what are the differences between the no openmp version and the single threaded openmp version?</p> <p>One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)</p> <p>More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.</p> <hr> <p>To finish this answer, I'd like to show you exactly how <code>-fopenmp</code> affects <code>gcc</code>'s options. (Note: I'm on an old computer now, so I have gcc 4.4.3)</p> <p>Running <code>gcc -Q -v some_file.c</code> gives this (relevant) output:</p> <pre><code>GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed: -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486 -fstack-protector options enabled: -falign-loops -fargument-alias -fauto-inc-dec -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -finline-functions-called-once -fira-share-save-slots -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4 -mpush-args -msahf -mtls-direct-seg-refs </code></pre> <p>and running <code>gcc -Q -v -fopenmp some_file.c</code> gives this (relevant) output:</p> <pre><code>GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed: -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486 -fopenmp -fstack-protector options enabled: -falign-loops -fargument-alias -fauto-inc-dec -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -finline-functions-called-once -fira-share-save-slots -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4 -mpush-args -msahf -mtls-direct-seg-refs </code></pre> <p>Taking a diff, we can see that the only difference is that with <code>-fopenmp</code>, we have <code>-D_REENTRANT</code> defined (and of course <code>-fopenmp</code> enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.</p> <hr> <p><strong>Update:</strong> I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added <code>-O3</code> will give the same difference. So, even with <code>-O3</code>, there are no optimization's disabled.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload