Note that there are some explanatory texts on larger screens.

plurals
  1. POIn Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
    primarykey
    data
    text
    <p>I am using Hadoop to analyze a very uneven distribution of data. Some keys have thousands of values, but most have only one. For example, network traffic associated with IP addresses would have many packets associated with a few talkative IPs and just a few with most IPs. Another way of saying this is that the <a href="http://en.wikipedia.org/wiki/Gini_index" rel="nofollow">Gini index</a> is very high.</p> <p>To process this efficiently, each reducer should either get a few high-volume keys or a lot of low-volume keys, in such a way as to get a roughly even load. I know how I would do this if I were writing the partition process: I would take the sorted list of <code>keys</code> (including all duplicate keys) that was produced by the mappers as well as the number of reducers <code>N</code> and put splits at</p> <pre><code>split[i] = keys[floor(i*len(keys)/N)] </code></pre> <p>Reducer <code>i</code> would get keys <code>k</code> such that <code>split[i] &lt;= k &lt; split[i+1]</code> for <code>0 &lt;= i &lt; N-1</code> and <code>split[i] &lt;= k</code> for <code>i == N-1</code>.</p> <p>I'm willing to write my own partitioner in Java, but the <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Partitioner.html" rel="nofollow">Partitioner&lt;KEY,VALUE&gt;</a> class only seems to have access to one key-value record at a time, not the whole list. I know that Hadoop sorts the records that were produced by the mappers, so this list must exist somewhere. It might be distributed among several partitioner nodes, in which case I would do the splitting procedure on one of the sublists and somehow communicate the result to all other partitioner nodes. (Assuming that the chosen partitioner node sees a randomized subset, the result would still be approximately load-balanced.) <strong>Does anyone know where the sorted list of keys is stored, and how to access it?</strong></p> <p>I don't want to write two map-reduce jobs, one to find the splits and another to actually use them, because that seems wasteful. (The mappers would have to do the same job twice.) This seems like a general problem: uneven distributions are pretty common.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload