StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POStrategies for concurrent pipelines in Java
text
Body
copied!<p>Consider the following shell script:</p> <pre><code>gzip -dc in.gz | sed -e 's/@/_at_/g' | gzip -c > out.gz </code></pre> <p>This has three processes working in parallel to decompress a stream, modify it, and re-compress it. Running <code>time</code> I can see my user time is about twice that of my real time, which indicates the program is effectively working in parallel. </p> <p>I've attempted to create the same program in Java by placing each task in it's own thread. Unfortunately, the multithreaded Java program is only about <a href="http://bitbucket.org/brianegge/java-concurrent/src/tip/README" rel="noreferrer">30% faster</a> than the <a href="http://bitbucket.org/brianegge/java-concurrent/src/tip/src/org/egge/concurrent/SerialTest.java" rel="noreferrer">single threaded</a> version for the above sample. I've tried using both an <a href="http://bitbucket.org/brianegge/java-concurrent/src/tip/src/org/egge/concurrent/ParallelExchangerTest.java" rel="noreferrer">Exchanger</a> and a <a href="http://bitbucket.org/brianegge/java-concurrent/src/tip/src/org/egge/concurrent/ConcurrentQueueTest.java" rel="noreferrer">ConcurrentLinkedQueue</a>. The ConcurrentLinkedQueue linked queue causes a lot of contention, although all three threads are generally kept busy. The Exchanger has lower contention, but is more complicated, and the doesn't seem to keep the slowest worker running 100% of the time. </p> <p>I'm trying to figure out a pure Java solution to this problem without looking at one of the byte code weaving frameworks or a JNI based MPI. </p> <p>Most of the concurrency research and APIs concern themselves with <a href="http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm" rel="noreferrer">divide-and-conquer</a> algorithms, giving each node work which is orthogonal and non-dependent on prior calculations. Another approach to concurrency is the pipeline approach, where each worker does some work and passes the data onto the next worker. </p> <p>I'm not trying to find the most efficient way to sed a gzip'd file, but rather I'm looking at how to efficiently break down tasks in a pipeline, in order to reduce the runtime to that of the slowest task.</p> <p>Current timings for a 10m line file are as follows:</p> <pre><code>Testing via shell real 0m31.848s user 0m58.946s sys 0m1.694s Testing SerialTest real 0m59.997s user 0m59.263s sys 0m1.121s Testing ParallelExchangerTest real 0m41.573s user 1m3.436s sys 0m1.830s Testing ConcurrentQueueTest real 0m44.626s user 1m24.231s sys 0m10.856s </code></pre> <p>I'm offering a bounty for a 10% improvement in Java, as measured by real time on a four core system with 10m rows of test data. Current sources are available on <a href="http://bitbucket.org/brianegge/java-concurrent/src/tip/" rel="noreferrer">Bitbucket</a>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload