Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy are most string manipulations in Java based on regexp?
    primarykey
    data
    text
    <p>In Java there are a bunch of methods that all have to do with manipulating Strings. The simplest example is the String.split("something") method.</p> <p>Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.</p> <p>Now there are two effects you'll see in many of those methods:</p> <ol> <li>They recompile the expression each time the method is invoked. As such they impose a performance impact.</li> <li>I've found that in most "real-life" situations these methods are called with "fixed" texts. The most common usage of the split method is even worse: It's usually called with a single char (usually a ' ', a ';' or a '&amp;') to split by.</li> </ol> <p>So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.</p> <p>So I was wondering: why was the Java API chosen the way it is now? What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??</p> <hr> <p>Update: I slapped together a few calls to see how much time the various ways of splitting a string would take. </p> <p>Short summary: It makes a <strong>big</strong> difference!</p> <p>I did 10000000 iterations for each test case, always using the input</p> <pre><code>"aap,noot,mies,wim,zus,jet,teun" </code></pre> <p>and always using ',' or "," as the split argument.</p> <p>This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):</p> <pre><code>fastSplit STRING Test 1 : 11405 milliseconds: Split in several pieces Test 2 : 3018 milliseconds: Split in 2 pieces Test 3 : 4396 milliseconds: Split in 3 pieces homegrown fast splitter based on char Test 4 : 9076 milliseconds: Split in several pieces Test 5 : 2024 milliseconds: Split in 2 pieces Test 6 : 2924 milliseconds: Split in 3 pieces homegrown splitter based on char that always splits in 2 pieces Test 7 : 1230 milliseconds: Split in 2 pieces String.split(regex) Test 8 : 32913 milliseconds: Split in several pieces Test 9 : 30072 milliseconds: Split in 2 pieces Test 10 : 31278 milliseconds: Split in 3 pieces String.split(regex) using precompiled Pattern Test 11 : 26138 milliseconds: Split in several pieces Test 12 : 23612 milliseconds: Split in 2 pieces Test 13 : 24654 milliseconds: Split in 3 pieces StringTokenizer Test 14 : 27616 milliseconds: Split in several pieces Test 15 : 28121 milliseconds: Split in 2 pieces Test 16 : 27739 milliseconds: Split in 3 pieces </code></pre> <p>As you can see it makes a big difference if you have a lot of "fixed char" splits to do.</p> <p>To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a <em>big</em> website. So to me this stuff really matters :)</p> <p>Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.</p> <p>My conclusions so far:</p> <ul> <li>Only optimize this if you have a LOT of strings to split.</li> <li>If you use the regex methods always precompile if you repeatedly use the same pattern.</li> <li>Forget the (obsolete) StringTokenizer</li> <li>If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).</li> </ul> <p>P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.</p> <pre><code>private static String[] stringSplitChar(final String input, final char separator) { int pieces = 0; // First we count how many pieces we will need to store ( = separators + 1 ) int position = 0; do { pieces++; position = input.indexOf(separator, position + 1); } while (position != -1); // Then we allocate memory final String[] result = new String[pieces]; // And start cutting and copying the pieces. int previousposition = 0; int currentposition = input.indexOf(separator); int piece = 0; final int lastpiece = pieces - 1; while (piece &lt; lastpiece) { result[piece++] = input.substring(previousposition, currentposition); previousposition = currentposition + 1; currentposition = input.indexOf(separator, previousposition); } result[piece] = input.substring(previousposition); return result; } private static String[] stringSplitChar(final String input, final char separator, final int maxpieces) { if (maxpieces &lt;= 0) { return stringSplitChar(input, separator); } int pieces = maxpieces; // Then we allocate memory final String[] result = new String[pieces]; // And start cutting and copying the pieces. int previousposition = 0; int currentposition = input.indexOf(separator); int piece = 0; final int lastpiece = pieces - 1; while (currentposition != -1 &amp;&amp; piece &lt; lastpiece) { result[piece++] = input.substring(previousposition, currentposition); previousposition = currentposition + 1; currentposition = input.indexOf(separator, previousposition); } result[piece] = input.substring(previousposition); // All remaining array elements are uninitialized and assumed to be null return result; } private static String[] stringChop(final String input, final char separator) { String[] result; // Find the separator. final int separatorIndex = input.indexOf(separator); if (separatorIndex == -1) { result = new String[1]; result[0] = input; } else { result = new String[2]; result[0] = input.substring(0, separatorIndex); result[1] = input.substring(separatorIndex + 1); } return result; } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload