StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>The idea of hadoop is that it can help you to parallelize code execution. If you have only one machine I don't think that hadoop is suitable for you. Since you have 2 cores you may take advantage of Java threads. </p> <p>Another limitation factor is memory. Basicaly if you can fetch all records into memory just do it before you starts the computations. If it's not the case (seems that db size exceeds your RAM size) helper thread can fetch additional records from database into memory as soon as computation threads have finished with some records. Below is the algorithm sketch:</p> <ul> <li>Two Worker threads would work in parallel (num of threads = num of cpus because of compute intensive task)</li> <li><p>FirstArray = Load 10.000 in the array or ArrayList, make sure that you're not using concurrent structures. Both threads would access this array, however would not mutate it. SecondArray would be delivered by DB Thread (points 3-4). FirstArray would be the same for both threads, SecondArray would be different. You would have nested loops:</p> <pre><code>for (elem1: FirstArray) { for (elem2: SecondArray){ computeSmth(elem1, elem2) if (bestSoFar()) store() } } </code></pre></li> </ul> <p>As soon as Worker thread is done it asks BlockingQueue for the next portion of data - that is new SecondArray.</p> <ul> <li>DB Thread (actually third thread) would be responsible for fetching data from the database in batches and populating arrays that would be further processesed by Worker Threads.</li> <li>Suppose 400.000 elements from the second table fits into memory. Let's split it into 4 regions. <ul> <li>1 region would be used for the elements being processed by the first thread,</li> <li>2 region would be used for elements beeing processed by the second thread, </li> <li>3 region is an array that waits to be taken by one of the threads in BlockingQueue (with capacity of 1), </li> <li>4 would be used for the data that is fetched from the database, but could not be put to the queue, because another array is not taken by one of the worker threads. This basically means that DB Thread would block until some thread takes next array from the queue, this means that it's done with previous array and previous array could be GCed, that means that you would not run out of memory. </li> </ul></li> <li>Queue size might be a subject for tuning based on maximum MySQL batch size, MySQL retrieval time and time spent on one batch processing by Worker thread.</li> <li>The logic of bestSoFar() should be well-thought to minimize thread synchronization.</li> <li>Basically the algorithm should scale well (each CPU may give near linear improvement).</li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload