StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:</p> <pre><code>public void aggregate() throws Exception { BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv")); // Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings. Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000); while (bigFile.ready()) { String line = bigFile.readLine(); int lastTab = line.lastIndexOf('\t'); String firstFourColumns = line.substring(0, lastTab); // See if the map already contains an entry for the first 4 columns HashSet<String> set = map.get(firstFourColumns); // If set is null, then the map hasn't seen these columns before if (set==null) { // Make a new Set (for aggregation), and add it to the map set = new HashSet<String>(); map.put(firstFourColumns, set); } // At this point we either found set or created it ourselves String lastColumn = line.substring(lastTab+1); set.add(lastColumn); } bigFile.close(); // A demo that shows how to iterate over the map and set structures for (Map.Entry<String, HashSet<String>> entry : map.entrySet()) { String firstFourColumns = entry.getKey(); System.out.print(firstFourColumns + "="); HashSet<String> aggregatedLastColumns = entry.getValue(); for (String column : aggregatedLastColumns) { System.out.print(column + ","); } System.out.println(""); } } </code></pre> <p>A few points:</p> <ul> <li>The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.</li> <li>If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.</li> <li>This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg <code>-Xmx4096m</code> to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.</li> <li>This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]</li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload