StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POMongoDB unique value aggregation via map reduce
primarykey
Id
10467963
data
AcceptedAnswerId
10468004
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-05-06T03:56:53.397
FavoriteCount
4
LastActivityDate
2017-06-27T10:47:52.477
LastEditDate
2017-06-27T10:47:52.477
LastEditorUserId
2313887
OwnerUserId
833915
ParentId
0
PostTypeId
1
Score
4
ViewCount
6665
LastEditorDisplayName
text
Body
I see plenty of questions on SO about aggregation in MongoDB, however, I have not found a complete solution to mine yet. Here's an example of my data: <pre class="lang-js prettyprint-override"><code>{ "fruits" : { "apple" : "red", "orange" : "orange", "plum" : "purple" } } { "fruits" : { "apple" : "green", "plum" : "purple" } } { "fruits" : { "apple" : "red", "orange" : "yellow", "plum" : "purple" } } </code></pre> Now, my goal is to determine the popularity of each color for each fruit, so something like this would be the output collection: <pre class="lang-js prettyprint-override"><code>{ "_id" : "apple" "values" : { "red" : 2, "green" : 1 } } { "_id" : "orange" "values" : { "orange" : 1, "yellow" : 1 } } { "_id" : "plum" "values" : { "purple" : 3 } } </code></pre> I have tried various M/R functions, and in the end they either don't work, or they take exponentially long. In the context of the example (fruit), I have about 1,000 different fruits and 100,000 colors over about 10,000,000 total documents. My current working M/R is this: <pre class="lang-js prettyprint-override"><code>map = function() { if (!this.fruits) return; for (var fruit in this.fruits) { emit(fruit, { val_array: [ {value: this.fruits[fruit], count: 1} ] }); } }; reduce = function(key, values) { var collection = { val_array: [] }; var found = false; values.forEach(function(map_obj) { map_obj.val_array.forEach(function(value_obj) { found = false; // if exists in collection, inc, else add collection.val_array.forEach(function(coll_obj) { if (coll_obj.value == value_obj.value) { // the collection already has this object, increment it coll_obj.count += value_obj.count; found = true; return; } }); if (!found) { // the collection doesn't have this obj yet, push it collection.val_array.push(value_obj); } }); }); return collection; }; </code></pre> Now, this does work, and for 100 records, it takes just a second or so, but the time increases non-linearly, so 100M records would take a very long time. The problem is that I'm doing a poor-mans sub-aggregation in the reduce function with the <code>collection</code> array, thus requiring me to iterate over both <code>collection</code> and the values from my map function. Now I just need to figure out how to do this efficiently (even if it requires multiple reductions). Any suggestions are welcome! <hr/> EDIT For lack of a better place to post it, here's my solution. <hr/> First, I created a file called <code>mr.js</code>: <pre class="lang-js prettyprint-override"><code>map = function() { if (!this.fruits) return; var skip_fruits = { 'Watermelon':1, 'Grapefruit':1, 'Tomato':1 // yes, a tomato is a fruit } for (var fruit in this.fruits) { if (skip_fruits[fruit]) continue; var obj = {}; obj[this.fruits[fruit]] = 1; emit(fruit, obj); } }; reduce = function(key, values) { var out_values = {}; values.forEach(function(v) { for(var k in v) { // iterate values if (!out_values[k]) { out_values[k] = v[k]; // init missing counter } else { out_values[k] += v[k]; } } }); return out_values; }; var in_coll = "fruit_repo"; var out_coll = "fruit_agg_so"; var total_docs = db[in_coll].count(); var page_size = 100000; var pages = Math.floor(total_docs / page_size); print('Starting incremental MR job with '+pages+' pages'); db[out_coll].drop(); for (var i=0; i<pages; i++) { var skip = page_size * i; print("Calculating page limits for "+skip+" - "+(skip+page_size-1)+"..."); var start_date = db[in_coll].find({},{date:1}).sort({date:1}).skip(skip).limit(1)[0].date; var end_date = db[in_coll].find({},{date:1}).sort({date:1}).skip(skip+page_size-1).limit(1)[0].date; var mr_command = { mapreduce: in_coll, map: map, reduce: reduce, out: {reduce: out_coll}, sort: {date: 1}, query: { date: { $gte: start_date, $lt: end_date } }, limit: (page_size - 1) }; print("Running mapreduce for "+skip+" - "+(skip+page_size-1)); db[in_coll].runCommand(mr_command); } </code></pre> That file iterates over my entire collection, incrementally map/reducing 100k docs (sorted by <code>date</code> which MUST have an index!) at a time, and reducing them into a single output collection. It's used like this: <code>mongo db_name mr.js</code>. Then, after a couple hours, I've got a collection with all the info. To figure out which fruits have the most colors, I use this from the mongo shell to print out the top 25: <pre class="lang-js prettyprint-override"><code>// Show number of number of possible values per key var keys = []; for (var c = db.fruit_agg_so.find(); c.hasNext();) { var obj = c.next(); if (!obj.value) break; var len=0;for(var l in obj.value){len++;} keys.push({key: obj['_id'], value: len}); } keys.sort(function(a, b){ if (a.value == b.value) return 0; return (a.value > b.value)? -1: 1; }); for (var i=0; i<20; i++) { print(keys[i].key+':'+keys[i].value); } </code></pre> The really cool thing about this approach is that since it's incremental, I can work with the output data while the mapreduce is running.
Tags
<javascript><mongodb><mapreduce><mongodb-query>
Title
MongoDB unique value aggregation via map reduce
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USNeil Lunn
UserOwnerUserId
1. USSteveK
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POMongoDB unique value aggregation via map reduce
 UserUserId
 USMcGarnagle
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 POMongoDB unique value aggregation via map reduce
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POMongoDB unique value aggregation via map reduce
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.