Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy is reading lines from stdin much slower in C++ than Python?
    text
    copied!<p>I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not yet an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something.</p> <hr> <p>(TLDR answer: include the statement: <code>cin.sync_with_stdio(false)</code> or just use <code>fgets</code> instead.</p> <p>TLDR results: scroll all the way down to the bottom of my question and look at the table.)</p> <hr> <p><strong>C++ code:</strong></p> <pre><code>#include &lt;iostream&gt; #include &lt;time.h&gt; using namespace std; int main() { string input_line; long line_count = 0; time_t start = time(NULL); int sec; int lps; while (cin) { getline(cin, input_line); if (!cin.eof()) line_count++; }; sec = (int) time(NULL) - start; cerr &lt;&lt; "Read " &lt;&lt; line_count &lt;&lt; " lines in " &lt;&lt; sec &lt;&lt; " seconds."; if (sec &gt; 0) { lps = line_count / sec; cerr &lt;&lt; " LPS: " &lt;&lt; lps &lt;&lt; endl; } else cerr &lt;&lt; endl; return 0; } // Compiled with: // g++ -O3 -o readline_test_cpp foo.cpp </code></pre> <p><strong>Python Equivalent:</strong></p> <pre><code>#!/usr/bin/env python import time import sys count = 0 start = time.time() for line in sys.stdin: count += 1 delta_sec = int(time.time() - start_time) if delta_sec &gt;= 0: lines_per_sec = int(round(count/delta_sec)) print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec, lines_per_sec)) </code></pre> <p><strong>Here are my results:</strong></p> <pre><code>$ cat test_lines | ./readline_test_cpp Read 5570000 lines in 9 seconds. LPS: 618889 $cat test_lines | ./readline_test.py Read 5570000 lines in 1 seconds. LPS: 5570000 </code></pre> <p><strong>Edit:</strong> <em>I should note that I tried this both under Mac&nbsp;OS&nbsp;X&nbsp;v10.6.8 (Snow&nbsp;Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.</em></p> <p><strong>Edit 2:</strong> <em>(Removed this edit, as no longer applicable)</em></p> <pre><code>$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done Test run 1 at Mon Feb 20 21:29:28 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 2 at Mon Feb 20 21:29:39 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 3 at Mon Feb 20 21:29:50 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 4 at Mon Feb 20 21:30:01 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 5 at Mon Feb 20 21:30:11 EST 2012 CPP: Read 5570001 lines in 10 seconds. LPS: 557000 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 </code></pre> <p><strong>Edit 3:</strong></p> <p>Okay, I tried J.N.'s suggestion of trying having Python store the line read: but it made no difference to python's speed.</p> <p>I also tried J.N.'s suggestion of using <code>scanf</code> into a <code>char</code> array instead of <code>getline</code> into a <code>std::string</code>. Bingo! This resulted in equivalent performance for both Python and C++. (3,333,333 LPS with my input data, which by the way are just short lines of three fields each, usually about 20 characters wide, though sometimes more).</p> <p>Code:</p> <pre><code>char input_a[512]; char input_b[32]; char input_c[512]; while(scanf("%s %s %s\n", input_a, input_b, input_c) != EOF) { line_count++; }; </code></pre> <p>Speed:</p> <pre><code>$ cat test_lines | ./readline_test_cpp2 Read 10000000 lines in 3 seconds. LPS: 3333333 $ cat test_lines | ./readline_test2.py Read 10000000 lines in 3 seconds. LPS: 3333333 </code></pre> <p>(Yes, I ran it several times.) So, I guess I will now use <code>scanf</code> instead of <code>getline</code>. But, I'm still curious if people think this performance hit from <code>std::string</code>/<code>getline</code> is typical and reasonable.</p> <p><strong>Edit 4 (was: Final Edit / Solution):</strong></p> <p>Adding:</p> <pre><code>cin.sync_with_stdio(false); </code></pre> <p>Immediately above my original while loop above results in code that runs faster than Python.</p> <p><strong>New performance comparison</strong> (this is on my 2011 MacBook Pro), using the original code, the original with the sync disabled, and the original Python code, respectively, on a file with 20M lines of text. Yes, I ran it several times to eliminate disk caching confound.</p> <pre><code>$ /usr/bin/time cat test_lines_double | ./readline_test_cpp 33.30 real 0.04 user 0.74 sys Read 20000001 lines in 33 seconds. LPS: 606060 $ /usr/bin/time cat test_lines_double | ./readline_test_cpp1b 3.79 real 0.01 user 0.50 sys Read 20000000 lines in 4 seconds. LPS: 5000000 $ /usr/bin/time cat test_lines_double | ./readline_test.py 6.88 real 0.01 user 0.38 sys Read 20000000 lines in 6 seconds. LPS: 3333333 </code></pre> <p>Thanks to @Vaughn Cato for his answer! <strong><em>Any elaboration people can make or good references people can point to as to why this synchronisation happens, what it means, when it's useful, and when it's okay to disable would be greatly appreciated by posterity.</em></strong> :-)</p> <p><strong>Edit 5 / Better Solution:</strong></p> <p>As suggested by Gandalf The Gray below, <code>gets</code> is even faster than <code>scanf</code> or the unsynchronized <code>cin</code> approach. I also learned that <a href="http://c-faq.com/stdio/scanfprobs.html" rel="noreferrer"><code>scanf</code></a> and <a href="http://c-faq.com/stdio/getsvsfgets.html" rel="noreferrer"><code>gets</code></a> are both UNSAFE and should NOT BE USED due to potential of buffer overflow. So, I wrote this iteration using <code>fgets</code>, the safer alternative to gets. Here are the pertinent lines for my fellow noobs:</p> <pre><code>char input_line[MAX_LINE]; char *result; //&lt;snip&gt; while((result = fgets(input_line, MAX_LINE, stdin )) != NULL) line_count++; if (ferror(stdin)) perror("Error reading stdin."); </code></pre> <p>Now, here are the results using an even larger file (100M lines; ~3.4&nbsp;GB) on a fast server with very fast disk, comparing the Python code, the unsynchronised <code>cin</code>, and the <code>fgets</code> approaches, as well as comparing with the wc utility. [The <code>scanf</code> version segmentation faulted and I don't feel like troubleshooting it.]:</p> <pre><code>$ /usr/bin/time cat temp_big_file | readline_test.py 0.03user 2.04system 0:28.06elapsed 7%CPU (0avgtext+0avgdata 2464maxresident)k 0inputs+0outputs (0major+182minor)pagefaults 0swaps Read 100000000 lines in 28 seconds. LPS: 3571428 $ /usr/bin/time cat temp_big_file | readline_test_unsync_cin 0.03user 1.64system 0:08.10elapsed 20%CPU (0avgtext+0avgdata 2464maxresident)k 0inputs+0outputs (0major+182minor)pagefaults 0swaps Read 100000000 lines in 8 seconds. LPS: 12500000 $ /usr/bin/time cat temp_big_file | readline_test_fgets 0.00user 0.93system 0:07.01elapsed 13%CPU (0avgtext+0avgdata 2448maxresident)k 0inputs+0outputs (0major+181minor)pagefaults 0swaps Read 100000000 lines in 7 seconds. LPS: 14285714 $ /usr/bin/time cat temp_big_file | wc -l 0.01user 1.34system 0:01.83elapsed 74%CPU (0avgtext+0avgdata 2464maxresident)k 0inputs+0outputs (0major+182minor)pagefaults 0swaps 100000000 Recap (lines per second): python: 3,571,428 cin (no sync): 12,500,000 fgets: 14,285,714 wc: 54,644,808 </code></pre> <p>As you can see, <code>fgets</code> is better, but still pretty far from wc performance; I'm pretty sure this is due to the fact that wc examines each character without any memory copying. I suspect that, at this point, other parts of the code will become the bottleneck, so I don't think optimizing to that level would even be worthwhile, even if possible (since, after all, I actually need to store the read lines in memory).</p> <p>Also note that a small tradeoff with using a <code>char *</code> buffer and <code>fgets</code> vs. unsynchronised <code>cin</code> to string is that the latter can read lines of any length, while the former requires limiting input to some finite number. In practice, this is probably a non-issue for reading most line-based input files, as the buffer can be set to a very large value that would not be exceeded by valid input.</p> <p>This has been educational. Thanks to all for your comments and suggestions.</p> <p><strong>Edit 6:</strong></p> <p>As suggested by J.F. Sebastian in the comments below, the GNU wc utility uses plain C <code>read()</code> (within the safe-read.c wrapper) to read chunks (of 16k bytes) at a time and count new lines. Here's a Python equivalent based on J.F.'s code (just showing the relevant snippet that replaces the Python <code>for</code> loop:</p> <pre><code>BUFFER_SIZE = 16384 count = sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, BUFFER_SIZE), '')) </code></pre> <p>The performance of this version is quite fast (though still a bit slower than the raw C wc utility, of course):</p> <pre><code>$ /usr/bin/time cat temp_big_file | readline_test3.py 0.01user 1.16system 0:04.74elapsed 24%CPU (0avgtext+0avgdata 2448maxresident)k 0inputs+0outputs (0major+181minor)pagefaults 0swaps Read 100000000 lines in 4.7275 seconds. LPS: 21152829 </code></pre> <p>Again, it's a bit silly for me to compare C++ <code>fgets</code>/<code>cin</code> and the first python code on the one hand to <code>wc -l</code> and this last Python snippet on the other, as the latter two don't actually store the read lines, but merely count newlines. Still, it's interesting to explore all the different implementations and think about the performance implications. Thanks again!</p> <p><strong>Edit 7: Tiny benchmark addendum and recap</strong></p> <p>For completeness, I thought I'd update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here's the complete table now:</p> <pre><code>Implementation Lines per second python (default) 3,571,428 cin (default/naive) 819,672 cin (no sync) 12,500,000 fgets 14,285,714 wc (not fair comparison) 54,644,808 </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload