Note that there are some explanatory texts on larger screens.

plurals
  1. POThis looks like a job for MapReduce...but I just can't figure it out
    primarykey
    data
    text
    <p>I've been battling with this for about 2 days now, and any help would be tremendously appreciated. I currently have a very large MongoDB collection(over 100M documents) in the following format:</p> <pre><code>[_id] [date] [score] [meta1] [text1] [text2] [text3] [text4] [meta2] </code></pre> <p>This isn't the exact data in there, I've obfuscated it a little for the purpose of this post, but the schema is identical, and no the format of that data cannot be changed, that's just the way it is.</p> <p>There are a TON of duplicate entries in there, a job is running once a day day adding millions of entries to the database that may have the same data in the text fields but different values for the score, meta1, and meta2 fields. So I need to eliminate the duplicates and shoehorn everything into one collection without duplicate texts:</p> <p>First, I'm going to concatenate the text fields and hash the result, so I have no duplicates containing the same text fields(this part is easy and already works). </p> <p>Here's where I'm struggling: The resulting collection will have an array of each unique meta1, which will in turn be an array containing the dates and scores matching it.</p> <p>So if I have the following three documents in my collection now:</p> <pre><code>[_id] =&gt; random mongoid [date] =&gt; 12092010 [score] =&gt; 3 [meta1] =&gt; somemetadatahere [text1] =&gt; foo [text2] =&gt; bar [text3] =&gt; foo2 [text4] =&gt; bar2 [meta2] =&gt; uniquemeta2data [_id] =&gt; random mongoid [date] =&gt; 12092010 [score] =&gt; 5 [meta1] =&gt; othermetadata [text1] =&gt; foo [text2] =&gt; bar [text3] =&gt; foo2 [text4] =&gt; bar2 [meta2] =&gt; uniquemeta2data1 [_id] =&gt; random mongoid [date] =&gt; 12102010 [score] =&gt; 7 [meta1] =&gt; somemetadatahere (same meta1 as the first document) [text1] =&gt; foo [text2] =&gt; bar [text3] =&gt; foo2 [text4] =&gt; bar2 [meta2] =&gt; uniquemeta2data </code></pre> <p>They should be reduced to this collection(indents are nested documents/arrays). The keys in the datas array come from the values of the meta1 field in the original collection:</p> <pre><code>[_id]=&gt; (md5 hash of all the text fields) [text1] =&gt; foo [text2] =&gt; bar [text3] =&gt; foo2 [text4] =&gt; bar2 [datas] [somemetadatahere] [meta2] =&gt; uniquemeta2data [scores] [12092010]=&gt;3 [12102010]=&gt;7 [othermetadata] [meta2] =&gt; uniquemeta2data1 [scores] [12092010]=&gt;3 </code></pre> <p>This seems like a perfect use case for a MapReduce job, but I'm having trouble wrapping my head around exactly how to do this. </p> <p>Is anyone up for the challenge of helping me figure this out?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload