Note that there are some explanatory texts on larger screens.

plurals
  1. POMongoDB unique value aggregation via map reduce
    primarykey
    data
    text
    <p>I see plenty of questions on SO about aggregation in MongoDB, however, I have not found a complete solution to mine yet.</p> <p>Here's an example of my data:</p> <pre class="lang-js prettyprint-override"><code>{ "fruits" : { "apple" : "red", "orange" : "orange", "plum" : "purple" } } { "fruits" : { "apple" : "green", "plum" : "purple" } } { "fruits" : { "apple" : "red", "orange" : "yellow", "plum" : "purple" } } </code></pre> <p>Now, my goal is to determine the popularity of each color for each fruit, so something like this would be the output collection:</p> <pre class="lang-js prettyprint-override"><code>{ "_id" : "apple" "values" : { "red" : 2, "green" : 1 } } { "_id" : "orange" "values" : { "orange" : 1, "yellow" : 1 } } { "_id" : "plum" "values" : { "purple" : 3 } } </code></pre> <p>I have tried various M/R functions, and in the end they either don't work, or they take exponentially long. In the context of the example (fruit), I have about 1,000 different fruits and 100,000 colors over about 10,000,000 total documents. My current working M/R is this:</p> <pre class="lang-js prettyprint-override"><code>map = function() { if (!this.fruits) return; for (var fruit in this.fruits) { emit(fruit, { val_array: [ {value: this.fruits[fruit], count: 1} ] }); } }; reduce = function(key, values) { var collection = { val_array: [] }; var found = false; values.forEach(function(map_obj) { map_obj.val_array.forEach(function(value_obj) { found = false; // if exists in collection, inc, else add collection.val_array.forEach(function(coll_obj) { if (coll_obj.value == value_obj.value) { // the collection already has this object, increment it coll_obj.count += value_obj.count; found = true; return; } }); if (!found) { // the collection doesn't have this obj yet, push it collection.val_array.push(value_obj); } }); }); return collection; }; </code></pre> <p>Now, this does work, and for 100 records, it takes just a second or so, but the time increases non-linearly, so 100M records would take a <em>very</em> long time. The problem is that I'm doing a poor-mans sub-aggregation in the reduce function with the <code>collection</code> array, thus requiring me to iterate over both <code>collection</code> and the values from my map function. Now I just need to figure out how to do this efficiently (even if it requires multiple reductions). Any suggestions are welcome!</p> <p><hr/> <strong>EDIT</strong> For lack of a better place to post it, here's my solution. <hr/> First, I created a file called <code>mr.js</code>:</p> <pre class="lang-js prettyprint-override"><code>map = function() { if (!this.fruits) return; var skip_fruits = { 'Watermelon':1, 'Grapefruit':1, 'Tomato':1 // yes, a tomato is a fruit } for (var fruit in this.fruits) { if (skip_fruits[fruit]) continue; var obj = {}; obj[this.fruits[fruit]] = 1; emit(fruit, obj); } }; reduce = function(key, values) { var out_values = {}; values.forEach(function(v) { for(var k in v) { // iterate values if (!out_values[k]) { out_values[k] = v[k]; // init missing counter } else { out_values[k] += v[k]; } } }); return out_values; }; var in_coll = "fruit_repo"; var out_coll = "fruit_agg_so"; var total_docs = db[in_coll].count(); var page_size = 100000; var pages = Math.floor(total_docs / page_size); print('Starting incremental MR job with '+pages+' pages'); db[out_coll].drop(); for (var i=0; i&lt;pages; i++) { var skip = page_size * i; print("Calculating page limits for "+skip+" - "+(skip+page_size-1)+"..."); var start_date = db[in_coll].find({},{date:1}).sort({date:1}).skip(skip).limit(1)[0].date; var end_date = db[in_coll].find({},{date:1}).sort({date:1}).skip(skip+page_size-1).limit(1)[0].date; var mr_command = { mapreduce: in_coll, map: map, reduce: reduce, out: {reduce: out_coll}, sort: {date: 1}, query: { date: { $gte: start_date, $lt: end_date } }, limit: (page_size - 1) }; print("Running mapreduce for "+skip+" - "+(skip+page_size-1)); db[in_coll].runCommand(mr_command); } </code></pre> <p>That file iterates over my entire collection, incrementally map/reducing 100k docs (sorted by <code>date</code> which MUST have an index!) at a time, and reducing them into a single output collection. It's used like this: <code>mongo db_name mr.js</code>.</p> <p>Then, after a couple hours, I've got a collection with all the info. To figure out which fruits have the most colors, I use this from the mongo shell to print out the top 25:</p> <pre class="lang-js prettyprint-override"><code>// Show number of number of possible values per key var keys = []; for (var c = db.fruit_agg_so.find(); c.hasNext();) { var obj = c.next(); if (!obj.value) break; var len=0;for(var l in obj.value){len++;} keys.push({key: obj['_id'], value: len}); } keys.sort(function(a, b){ if (a.value == b.value) return 0; return (a.value &gt; b.value)? -1: 1; }); for (var i=0; i&lt;20; i++) { print(keys[i].key+':'+keys[i].value); } </code></pre> <p>The really cool thing about this approach is that since it's incremental, I can work with the output data while the mapreduce is running.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload