StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I think I understand your problem now. For the sake of simplicity, consider an example where we will simply be adding 1000GB worth of integers (as you said addition can roughly equate to your fold() operation).</p> <p>So Map/Reduce has two phases - Map phase operates on a single data item (key, value pair), then feeds to a Reduce phase where aggregation can occur. Since your whole operation is one huge aggregation, you can use Hadoop's identity <a href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html" rel="nofollow">Mapper</a> along with a <a href="http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass%28java.lang.Class%29" rel="nofollow">Combiner</a> to break up the aggregation into a few, smaller ones.</p> <p>A Combiner is basically a Reducer that runs right after the Mapper of your job. The idea of it is to do aggregations on the data exiting a Map node to combine whatever it can and reduce the amount of data sent over the network to the Reduce nodes.</p> <p>Here's an example of an addition combiner</p> <pre><code> public class AddCombiner extends Reducer<SomeKey, IntWritable, SomeKey, IntWritable> { public void reduce(SomeKey key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int total = 0; for (IntWritable value : values) { total += value.get(); } context.write(key, new IntWritable(total)); } } </code></pre> <p>So you can run a Map/Reduce job on your 1000GB of input, have Combiners do the first level of aggregations after the Map tasks and then have one Reducer which takes the aggregated data from the Combiners and does one final aggregation into your final answer.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload