StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POIn Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
primarykey
Id
12117088
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-08-24T21:47:45.833
FavoriteCount
1
LastActivityDate
2014-03-26T09:47:49.753
LastEditDate
2012-10-10T22:27:38.723
LastEditorUserId
1442874
OwnerUserId
1623645
ParentId
0
PostTypeId
1
Score
7
ViewCount
328
LastEditorDisplayName
text
Body
I am using Hadoop to analyze a very uneven distribution of data. Some keys have thousands of values, but most have only one. For example, network traffic associated with IP addresses would have many packets associated with a few talkative IPs and just a few with most IPs. Another way of saying this is that the <a href="http://en.wikipedia.org/wiki/Gini_index" rel="nofollow">Gini index</a> is very high. To process this efficiently, each reducer should either get a few high-volume keys or a lot of low-volume keys, in such a way as to get a roughly even load. I know how I would do this if I were writing the partition process: I would take the sorted list of <code>keys</code> (including all duplicate keys) that was produced by the mappers as well as the number of reducers <code>N</code> and put splits at <pre><code>split[i] = keys[floor(i*len(keys)/N)] </code></pre> Reducer <code>i</code> would get keys <code>k</code> such that <code>split[i] <= k < split[i+1]</code> for <code>0 <= i < N-1</code> and <code>split[i] <= k</code> for <code>i == N-1</code>. I'm willing to write my own partitioner in Java, but the <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Partitioner.html" rel="nofollow">Partitioner<KEY,VALUE></a> class only seems to have access to one key-value record at a time, not the whole list. I know that Hadoop sorts the records that were produced by the mappers, so this list must exist somewhere. It might be distributed among several partitioner nodes, in which case I would do the splitting procedure on one of the sublists and somehow communicate the result to all other partitioner nodes. (Assuming that the chosen partitioner node sees a randomized subset, the result would still be approximately load-balanced.) Does anyone know where the sorted list of keys is stored, and how to access it? I don't want to write two map-reduce jobs, one to find the splits and another to actually use them, because that seems wasteful. (The mappers would have to do the same job twice.) This seems like a general problem: uneven distributions are pretty common.
Tags
<java><hadoop><mapreduce><partitioning><partitioner>
Title
In Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USChris Gerken
UserOwnerUserId
1. USJim Pivarski
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POIn Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POIn Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POIn Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.