StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POEfficient solution for grouping same values in a large dataset
text
Body
copied!<p>At my job I was to develop and implement a solution for the following problem:</p> <p>Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.</p> <p>I came up with the solution like this (using <a href="http://static.springsource.org/spring-batch/" rel="nofollow noreferrer">Spring-Batch</a>):</p> <p>Batch job steps:</p> <p><strong>Step 1.</strong> Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.</p> <p><strong>Step 2.</strong> Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.</p> <p><strong>Step 3.</strong> For each merged file (in other words for each key) sequentially read its values using <a href="http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html" rel="nofollow noreferrer">PriorityQueue</a> to maintain top 5000 values without loading all the values into memory. Write queue content to the database.</p> <p>I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.</p> <p><strong>The problem is that my manager consider this task way too easy to spend that much time on it.</strong></p> <p>And the question is - <em>do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?</em></p> <p>I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.</p> <p>Thank you in advance!</p> <p>UPD: I think I did not stated my question clearly. Let me ask in other way:</p> <p><strong>Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?</strong></p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload