Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I'm going to go against the general wisdom here that <code>std::copy</code> will have a slight, almost imperceptible performance loss. I just did a test and found that to be untrue: I did notice a performance difference. However, the winner was <code>std::copy</code>.</p> <p>I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the <code>memcpy</code> version and the <code>std::copy</code> version. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with <code>char</code> / <code>char *</code>, whereas I operate with <code>T</code> / <code>T *</code> (where <code>T</code> is the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:</p> <p><strong>Time (in seconds) to complete run of SHA-2 tests</strong></p> <pre><code>std::copy memcpy % increase 6.11 6.29 2.86% 6.09 6.28 3.03% 6.10 6.29 3.02% 6.08 6.27 3.03% 6.08 6.27 3.03% </code></pre> <p><strong>Total average increase in speed of std::copy over memcpy: 2.99%</strong></p> <p>My compiler is gcc 4.6.3 on Fedora 16 x86_64. My optimization flags are <code>-Ofast -march=native -funsafe-loop-optimizations</code>.</p> <p><a href="https://bitbucket.org/davidstone/sha-2/" rel="noreferrer">Code for my SHA-2 implementations.</a></p> <p>I decided to run a test on my MD5 implementation as well. The results were much less stable, so I decided to do 10 runs. However, after my first few attempts, I got results that varied wildly from one run to the next, so I'm guessing there was some sort of OS activity going on. I decided to start over.</p> <p>Same compiler settings and flags. There is only one version of MD5, and it's faster than SHA-2, so I did 3000 loops on a similar set of 5 test strings.</p> <p>These are my final 10 results:</p> <p><strong>Time (in seconds) to complete run of MD5 tests</strong></p> <pre><code>std::copy memcpy % difference 5.52 5.56 +0.72% 5.56 5.55 -0.18% 5.57 5.53 -0.72% 5.57 5.52 -0.91% 5.56 5.57 +0.18% 5.56 5.57 +0.18% 5.56 5.53 -0.54% 5.53 5.57 +0.72% 5.59 5.57 -0.36% 5.57 5.56 -0.18% </code></pre> <p><strong>Total average decrease in speed of std::copy over memcpy: 0.11%</strong></p> <p><a href="https://bitbucket.org/davidstone/md5" rel="noreferrer">Code for my MD5 implementation</a></p> <p>These results suggest that there is some optimization that std::copy used in my SHA-2 tests that <code>std::copy</code> could not use in my MD5 tests. In the SHA-2 tests, both arrays were created in the same function that called <code>std::copy</code> / <code>memcpy</code>. In my MD5 tests, one of the arrays was passed in to the function as a function parameter.</p> <p>I did a little bit more testing to see what I could do to make <code>std::copy</code> faster again. The answer turned out to be simple: turn on link time optimization. These are my results with LTO turned on (option -flto in gcc):</p> <p><strong>Time (in seconds) to complete run of MD5 tests with -flto</strong></p> <pre><code>std::copy memcpy % difference 5.54 5.57 +0.54% 5.50 5.53 +0.54% 5.54 5.58 +0.72% 5.50 5.57 +1.26% 5.54 5.58 +0.72% 5.54 5.57 +0.54% 5.54 5.56 +0.36% 5.54 5.58 +0.72% 5.51 5.58 +1.25% 5.54 5.57 +0.54% </code></pre> <p><strong>Total average increase in speed of std::copy over memcpy: 0.72%</strong></p> <p>In summary, there does not appear to be a performance penalty for using <code>std::copy</code>. In fact, there appears to be a performance gain.</p> <p><strong>Explanation of results</strong></p> <p>So why might <code>std::copy</code> give a performance boost?</p> <p>First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. <code>std::copy</code> can (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when <code>memcpy</code> is legal, <code>std::copy</code> should perform no worse. The trivial implementation of <code>std::copy</code> that defers to <code>memcpy</code> should meet your compiler's criteria of "always inline this when optimizing for speed or size".</p> <p>However, <code>std::copy</code> also keeps more of its information. When you call <code>std::copy</code>, the function keeps the types intact. <code>memcpy</code> operates on <code>void *</code>, which discards almost all useful information. For instance, if I pass in an array of <code>std::uint64_t</code>, the compiler or library implementer may be able to take advantage of 64-bit alignment with <code>std::copy</code>, but it may be more difficult to do so with <code>memcpy</code>. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.</p> <p><strong>Premature optimization?</strong></p> <p><code>std::copy</code> is in an interesting position. I expect it to never be slower than <code>memcpy</code> and sometimes faster with any modern optimizing compiler. Moreover, anything that you can <code>memcpy</code>, you can <code>std::copy</code>. <code>memcpy</code> does not allow any overlap in the buffers, whereas <code>std::copy</code> supports overlap in one direction (with <code>std::copy_backward</code> for the other direction of overlap). <code>memcpy</code> only works on pointers, <code>std::copy</code> works on any iterators (<code>std::map</code>, <code>std::vector</code>, <code>std::deque</code>, or my own custom type). In other words, you should just use <code>std::copy</code> when you need to copy chunks of data around.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload