Note that there are some explanatory texts on larger screens.

plurals
  1. PODistinct sorting and grouping with MongoDB aggregation framework
    primarykey
    data
    text
    <p>I've been toying with MongoDB's aggregation framework quite a bit lately and thought it would be a good solution to a problem I've been trying to wrap my head around.</p> <p>So, say I'm writing discussion board software and I have the following document structure for posts:</p> <pre><code>{ '_id': ObjectId, 'created_at': datetime, 'poster_id': ObjectId, 'discussion_id': ObjectId, 'body': string } </code></pre> <p>And I have the following (simplified) sample documents stored within the <code>posts</code> collection:</p> <pre><code>{ '_id': 1, 'created_at': '2013-08-18 12:00:00', 'poster_id': 1, 'discussion_id': 1, 'body': 'imma potato' } { '_id': 2, 'created_at': '2013-08-18 13:00:00', 'poster_id': 1, 'discussion_id': 1, 'body': 'im still a potato' } { '_id': 3, 'created_at': '2013-08-18 14:00:00', 'poster_id': 2, 'discussion_id': 1, 'body': 'you are definitely a potato' } { '_id': 4, 'created_at': '2013-08-18 15:00:00', 'poster_id': 3, 'discussion_id': 1, 'body': 'Wait... he is potato?' } { '_id': 5, 'created_at': '2013-08-18 16:00:00', 'poster_id': 2, 'discussion_id': 1, 'body': 'Yes! He is potato.' } { '_id': 6, 'created_at': '2013-08-18 16:01:00', 'poster_id': 3, 'discussion_id': 1, 'body': 'IF HE IS POTATO... THEN WHO WAS PHONE!?' } </code></pre> <p>What I am trying to do is return a distinct map of <code>poster_id</code>s with their latest post <code>_id</code> sorted by the latest post in descending order. So, in the end, given the above sample code, the mapping would look very similar to:</p> <pre><code>{ 3:6, 2:5, 1:2 } </code></pre> <p>Here is an example of a method I wrote in Python using pymongo's implementation of the MongoDB aggregation framework:</p> <pre><code>def get_posters_with_latest_post_by_discussion_ids(self, discussion_ids, start=None, end=None, skip=None, limit=None, order=-1): '''Returns a mapping of poster ids to their latest post associated with the given list of discussion_ids. A date range, ordering and paging properties can be applied. ''' pipelines = [] if order: pipelines.append({ '$sort': { 'created_at': order } }) if skip: pipelines.append({ '$skip': skip }) if limit: pipelines.append({ '$limit': limit }) match = { 'discussion_id': { '$in': discussion_ids } } if start and end: match['created_at'] = { '$gte': start, '$lt': end } pipelines.append({ '$match': match }) pipelines.append({ '$project': { 'poster_id': '$poster_id' } }) pipelines.append({ '$group': { '_id': '$poster_id', 'post_id': { '$first': '$_id' } } }) results = self.db.posts.aggregate(pipelines) poster_to_post_map = {} for result in results['result']: poster_to_post_map[result['_id']] = result['post_id'] return poster_to_post_map </code></pre> <p>Now that I have the mapping, I can query the <code>posters</code> and <code>posts</code> collections seperately for the full documents and then mung them together for display. </p> <p>Now, the problem isn't that it doesn't work, it does... kind of. Say I have a much higher volume of posts and I want to page through a list of posters with their latest post. If my page limit is "10 posters per page" and within the resulting 10 documents there exists a single poster with 2, or more, posts, I actually get back fewer than 10 items in my map.</p> <p>For example, I have 10 posts, 1 poster has 3 posts within the initial result. The aggregation framework will then discard the other 2 posts and associate the latest with that user, resulting in a map containing 8 entries, not 10. </p> <p>This is extremely frustrating as I cannot reliably paginate through the results. Nor can I accurately determine whether or not I'm on the last page of results as a set of results may, or may not, return 0 or more matches.</p> <p><strong>What, if anything, am I doing wrong here?</strong></p> <p>What I am trying to accomplish is simple enough and the aggregation framework seems like a perfect fit for my problem.</p> <p>This would be simple enough if it were a stored proc on a traditional relational database, but that's what we sacrifice when we move to schemaless document stores; relationships are managed outside of the context of the database.</p> <p>Anyhow, the code should be pretty easy to follow and I'll answer any questions you might have.</p> <p>Either way, thanks for taking the time to read. :)</p> <p><strong>Edit: SOLVED</strong></p> <p>Here is a gist of the solution for future viewers: <a href="https://gist.github.com/wilhelm-murdoch/6260469" rel="nofollow">https://gist.github.com/wilhelm-murdoch/6260469</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload