StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhy is splitting a string slower in C++ than Python?
text
Body
copied!<p>I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see <a href="https://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python">this</a>). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).</p> <p><strong>Python Code:</strong></p> <pre><code>#!/usr/bin/env python from __future__ import print_function import time import sys count = 0 start_time = time.time() dummy = None for line in sys.stdin: dummy = line.split() count += 1 delta_sec = int(time.time() - start_time) print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='') if delta_sec > 0: lps = int(count/delta_sec) print(" Crunch Speed: {0}".format(lps)) else: print('') </code></pre> <p><strong>C++ Code:</strong></p> <pre><code>#include <iostream> #include <string> #include <sstream> #include <time.h> #include <vector> using namespace std; void split1(vector<string> &tokens, const string &str, const string &delimiters = " ") { // Skip delimiters at beginning string::size_type lastPos = str.find_first_not_of(delimiters, 0); // Find first non-delimiter string::size_type pos = str.find_first_of(delimiters, lastPos); while (string::npos != pos || string::npos != lastPos) { // Found a token, add it to the vector tokens.push_back(str.substr(lastPos, pos - lastPos)); // Skip delimiters lastPos = str.find_first_not_of(delimiters, pos); // Find next non-delimiter pos = str.find_first_of(delimiters, lastPos); } } void split2(vector<string> &tokens, const string &str, char delim=' ') { stringstream ss(str); //convert string to stream string item; while(getline(ss, item, delim)) { tokens.push_back(item); //add token to vector } } int main() { string input_line; vector<string> spline; long count = 0; int sec, lps; time_t start = time(NULL); cin.sync_with_stdio(false); //disable synchronous IO while(cin) { getline(cin, input_line); spline.clear(); //empty the vector for the next line to parse //I'm trying one of the two implementations, per compilation, obviously: // split1(spline, input_line); split2(spline, input_line); count++; }; count--; //subtract for final over-read sec = (int) time(NULL) - start; cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ; if (sec > 0) { lps = count / sec; cerr << " Crunch speed: " << lps << endl; } else cerr << endl; return 0; //compiled with: g++ -Wall -O3 -o split1 split_1.cpp </code></pre> <p>Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from <a href="http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++Programming-HOWTO-7.html" rel="noreferrer">here</a>). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).</p> <p>I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"</p> <p><strong>Results:</strong></p> <pre><code>$ /usr/bin/time cat test_lines_double | ./split.py 15.61 real 0.01 user 0.38 sys Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333 $ /usr/bin/time cat test_lines_double | ./split1 23.50 real 0.01 user 0.46 sys C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565 $ /usr/bin/time cat test_lines_double | ./split2 44.69 real 0.02 user 0.62 sys C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444 </code></pre> <p>What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?</p> <p><strong>Edit 1 / Partial Solution?:</strong></p> <p>I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:</p> <pre><code>for line in sys.stdin: dummy = [] dummy += line.split() count += 1 </code></pre> <p>The performance of python is now about the same as the split1 C++ implementation. </p> <pre><code>/usr/bin/time cat test_lines_double | ./split5.py 22.61 real 0.01 user 0.40 sys Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090 </code></pre> <p>I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.) </p> <p>Thanks to all for your help.</p> <p><strong>Final Edit/Solution:</strong></p> <p>Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):</p> <pre><code>$ /usr/bin/time cat test_lines_double | ./split6 15.09 real 0.01 user 0.45 sys C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333 </code></pre> <p>My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case. </p> <p>One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education. </p> <p>Thanks again to all for your suggestions!</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload