StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>The best choice depends on a few factors. If you're only needing to scan the tokens once, then the boost::tokenizer is a good choice in both runtime and space performance (those vectors of tokens can take up a lot of space, depending on input data.)</p> <p>If you're going to be scanning the tokens often, or need a vector with efficient random access, then the boost::split into a vector may be the better option.</p> <p>For example, in your "A^B^C^...^Z" input string where the tokens are 1-byte in length, the <code>boost::split/vector<string></code> method will consume <em>at least</em> 2*N-1 bytes. With the way strings are stored in most STL implementations you can figure it taking more than 8x that count. Storing these strings in a vector is costly in terms of memory and time.</p> <p>I ran a quick test on my machine and a similar pattern with 10 million tokens looked like this:</p> <ul> <li>boost::split = <strong>2.5s</strong> and <strong>~620MB</strong> </li> <li>boost::tokenizer = <strong>0.9s</strong> and <strong>0MB</strong></li> </ul> <p>If you're just doing a one-time scan of the tokens, then clearly the tokenizer is better. But, if you're shredding into a structure that you want to reuse during the lifetime of your application, then having a vector of tokens may be preferred.</p> <p>If you want to go the vector route, then I'd recommend not using a <code>vector<string></code>, but a vector of string::iterators instead. Just shred into a pair of iterators and keep around your big string of tokens for reference. For example:</p> <pre><code>using namespace std; vector<pair<string::const_iterator,string::const_iterator> > tokens; boost::split(tokens, s, boost::is_any_of("^")); for(auto beg=tokens.begin(); beg!=tokens.end();++beg){ cout << string(beg->first,beg->second) << endl; } </code></pre> <p>This improved version takes <strong>1.6s</strong> and <strong>390MB</strong> on the same server and test. And, best of all the memory overhead of this vector is linear with the number of tokens -- not dependent in any way on the length of tokens, whereas a <code>std::vector<string></code> stores each token.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload