Note that there are some explanatory texts on larger screens.

plurals
  1. POJava CPU-intensive application stalls/hangs when increasing no. of workers. Where is the bottleneck, and how to deduce/ monitor it on a Ubuntu server?
    primarykey
    data
    text
    <p>I'm running a nightly CPU-intensive Java-application on an Ec2-server (c1.xlarge) which has eight cores, 7.5&nbsp;GB RAM (running Linux / <a href="https://en.wikipedia.org/wiki/List_of_Ubuntu_releases#Ubuntu_9.10_.28Karmic_Koala.29" rel="nofollow noreferrer">Ubuntu&nbsp;9.10</a> (Karmic Koala) 64 bit).</p> <p>The application is architected in such a way that a variable number of workers are constructed (each in their own thread) and fetch messages from a queue to process them.</p> <p>Throughput is the main concern here and performance is measured in processed messages / second. The application is NOT RAM-bound... And as far as I can see not I/O-bound. (although I'm not a star in Linux. I'm using dstat to check for I/O-load which are pretty low and CPU wait-signals (which are almost non-existent)).</p> <p>I'm seeing the following when spawning a different number of workers (worker-threads).</p> <ol> <li><p>Worker: throughput 1.3 messages / sec / worker</p></li> <li><p>worker: ~ throughput 0.8 messages / sec / worker</p></li> <li><p>worker: ~ throughput 0.5 messages / sec / worker</p></li> <li><p>worker: ~ throughput 0.05 messages / sec / worker</p></li> </ol> <p>I was expecting a near-linear increase in throughput, but reality proves otherwise.</p> <p>Three questions:</p> <ol> <li><p>What might be causing the sub-linear performance going from one worker --> two workers and two workers --> three workers?</p></li> <li><p>What might be causing the (almost) complete halt when going from three workers to four workers? It looks like a kind of deadlock-situation or something.. (can this happen due to heavy context-switching?)</p></li> <li><p>How would I start measuring where the problems occur? My development-box has two CPUs and is running under windows. I normally attach a GUI-profiler and check for threading-issues. But the problem only really starts to manifest itself my more than two threads.</p></li> </ol> <p>Some more background information:</p> <ul> <li><p>Workers are spawned using a Executors.newScheduledThreadPool</p></li> <li><p>A workers-thread does calculations based on the message (CPU-intensive). Each worker-thread contains a separate persistQueue used for offloading writing to disk (and thus make use of CPU / I/O concurrency.)</p> <p>persistQueue = new ThreadPoolExecutor(1, 1, 100, TimeUnit.MILLISECONDS, new ArrayBlockingQueue(maxAsyncQueueSize), new ThreadPoolExecutor.AbortPolicy());</p></li> </ul> <p>The flow (per worker) goes like this:</p> <ol> <li><p>The worker-thread puts the result of a message in the persistQueue and gets on with processing the next message.</p></li> <li><p>The ThreadpoolExecutor (of which we have one per worker-thread) only contains one thread which processes all incoming data (waiting in the persistQueue ) and writes it to disk (<a href="https://en.wikipedia.org/wiki/Berkeley_DB" rel="nofollow noreferrer">Berkeley&nbsp;DB</a> + Apache <a href="http://en.wikipedia.org/wiki/Lucene" rel="nofollow noreferrer">Lucene</a>).</p></li> <li><p>The idea is that 1. and 2. can run concurrent for the most part since 1. is CPU-heavy and 2. is I/O-heavy.</p></li> <li><p>It's possible that persistQueue becomes full. This is done because otherwise a slow I/O-system might cause flooding of the queues, and result in <a href="https://en.wikipedia.org/wiki/Out_of_memory" rel="nofollow noreferrer">OOM</a>-errors (yes, it's a lot of data). In that case the workerThread pauses until it can write its content to persistQueue. A full queue hasn't happened yet on this setup (which is another reason I think the application is definitely not I/O-bound).</p></li> </ol> <p>The last information:</p> <ul> <li><p>Workers are isolated from the others concerning their data, except:</p> <ul> <li><p>They share some heavily used static final maps (used as caches. The maps are memory-intensive, so I can't keep them local to a worker even if I wanted to). Operations that workers perform on these caches are: iterations, lookups, contains (no writes, deletes, etc.)</p></li> <li><p>These shared maps are accessed without synchronization (no need. right?)</p></li> <li><p>Workers populate their local data by selecting data from MySQL (based on keys in the received message). So this is a potential bottleneck. However, most of the data are reads, queried tables are optimized with indexes and again not I/O-bound.</p></li> <li><p>I have to admit that I haven't done much MySQL-server optimizing yet (in terms of <code>config -params</code>), but I just don't think that is the problem.</p></li> </ul></li> <li><p>Output is written to:</p> <ul> <li>Berkeley&nbsp;DB (using memcached(b)-client). All workers share one server.</li> <li>Lucene (using a home-grown low-level indexer). Each workers has a separate indexer.</li> </ul></li> <li><p>Even when disabling output writing, the problems occur.</p></li> </ul> <p>This is a huge post, I realize that, but I hope you can give me some pointers as to what this might be, or how to start monitoring / deducing where the problem lies.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload