StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Consider the following (quick and dirty) test:</p> <pre><code>import java.util.ArrayList; import java.util.List; import java.util.Random; import java.util.regex.Matcher; import java.util.regex.Pattern; public class Test3 { // time that tick() was called static long tickTime; // called at start of operation, for timing static void tick () { tickTime = System.nanoTime(); } // called at end of operation, prints message and time since tick(). static void tock (String action) { long mstime = (System.nanoTime() - tickTime) / 1000000; System.out.println(action + ": " + mstime + "ms"); } // generate random strings of form AAAABBBCCCCC; a random // number of characters each randomly repeated. static List<String> generateData (int itemCount) { Random random = new Random(); List<String> items = new ArrayList<String>(); long mean = 0; for (int n = 0; n < itemCount; ++ n) { StringBuilder s = new StringBuilder(); int characters = random.nextInt(7) + 1; for (int k = 0; k < characters; ++ k) { char c = (char)(random.nextInt('Z' - 'A') + 'A'); int rep = random.nextInt(95) + 5; for (int j = 0; j < rep; ++ j) s.append(c); mean += rep; } items.add(s.toString()); } mean /= itemCount; System.out.println("generated data, average length: " + mean); return items; } // match all strings in items to regexStr, do not precompile. static void regexTestUncompiled (List<String> items, String regexStr) { tick(); int matched = 0, unmatched = 0; for (String item:items) { if (item.matches(regexStr)) ++ matched; else ++ unmatched; } tock("uncompiled: regex=" + regexStr + " matched=" + matched + " unmatched=" + unmatched); } // match all strings in items to regexStr, precompile. static void regexTestCompiled (List<String> items, String regexStr) { tick(); Matcher matcher = Pattern.compile(regexStr).matcher(""); int matched = 0, unmatched = 0; for (String item:items) { if (matcher.reset(item).matches()) ++ matched; else ++ unmatched; } tock("compiled: regex=" + regexStr + " matched=" + matched + " unmatched=" + unmatched); } // test all strings in items against regexStr. static void regexTest (List<String> items, String regexStr) { regexTestUncompiled(items, regexStr); regexTestCompiled(items, regexStr); } // generate data and run some basic tests public static void main (String[] args) { List<String> items = generateData(1000000); regexTest(items, "A*"); regexTest(items, "A*B*C*"); regexTest(items, "E*C*W*F*"); } } </code></pre> <p>Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.</p> <p>Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:</p> <pre><code>Regex Uncompiled Precompiled A* 0.564 sec 0.126 sec A*B*C* 1.768 sec 0.238 sec E*C*W*F* 0.795 sec 0.275 sec </code></pre> <p>Actual output:</p> <pre><code>generated data, average length: 208 uncompiled: regex=A* matched=6004 unmatched=993996: 564ms compiled: regex=A* matched=6004 unmatched=993996: 126ms uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms </code></pre> <p>Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.</p> <p>After switching to precompiled expressions, if you still are not meeting your <strong>actual performance requirements</strong>, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.</p> <p>For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "A<em>B</em>C*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload