StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PODistinct sorting and grouping with MongoDB aggregation framework
primarykey
Id
18296156
data
AcceptedAnswerId
18296718
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-08-18T05:55:20.820
FavoriteCount
0
LastActivityDate
2014-05-08T23:21:20.857
LastEditDate
2013-08-18T07:54:22.783
LastEditorUserId
77457
OwnerUserId
77457
ParentId
0
PostTypeId
1
Score
2
ViewCount
2265
LastEditorDisplayName
text
Body
I've been toying with MongoDB's aggregation framework quite a bit lately and thought it would be a good solution to a problem I've been trying to wrap my head around. So, say I'm writing discussion board software and I have the following document structure for posts: <pre><code>{ '_id': ObjectId, 'created_at': datetime, 'poster_id': ObjectId, 'discussion_id': ObjectId, 'body': string } </code></pre> And I have the following (simplified) sample documents stored within the <code>posts</code> collection: <pre><code>{ '_id': 1, 'created_at': '2013-08-18 12:00:00', 'poster_id': 1, 'discussion_id': 1, 'body': 'imma potato' } { '_id': 2, 'created_at': '2013-08-18 13:00:00', 'poster_id': 1, 'discussion_id': 1, 'body': 'im still a potato' } { '_id': 3, 'created_at': '2013-08-18 14:00:00', 'poster_id': 2, 'discussion_id': 1, 'body': 'you are definitely a potato' } { '_id': 4, 'created_at': '2013-08-18 15:00:00', 'poster_id': 3, 'discussion_id': 1, 'body': 'Wait... he is potato?' } { '_id': 5, 'created_at': '2013-08-18 16:00:00', 'poster_id': 2, 'discussion_id': 1, 'body': 'Yes! He is potato.' } { '_id': 6, 'created_at': '2013-08-18 16:01:00', 'poster_id': 3, 'discussion_id': 1, 'body': 'IF HE IS POTATO... THEN WHO WAS PHONE!?' } </code></pre> What I am trying to do is return a distinct map of <code>poster_id</code>s with their latest post <code>_id</code> sorted by the latest post in descending order. So, in the end, given the above sample code, the mapping would look very similar to: <pre><code>{ 3:6, 2:5, 1:2 } </code></pre> Here is an example of a method I wrote in Python using pymongo's implementation of the MongoDB aggregation framework: <pre><code>def get_posters_with_latest_post_by_discussion_ids(self, discussion_ids, start=None, end=None, skip=None, limit=None, order=-1): '''Returns a mapping of poster ids to their latest post associated with the given list of discussion_ids. A date range, ordering and paging properties can be applied. ''' pipelines = [] if order: pipelines.append({ '$sort': { 'created_at': order } }) if skip: pipelines.append({ '$skip': skip }) if limit: pipelines.append({ '$limit': limit }) match = { 'discussion_id': { '$in': discussion_ids } } if start and end: match['created_at'] = { '$gte': start, '$lt': end } pipelines.append({ '$match': match }) pipelines.append({ '$project': { 'poster_id': '$poster_id' } }) pipelines.append({ '$group': { '_id': '$poster_id', 'post_id': { '$first': '$_id' } } }) results = self.db.posts.aggregate(pipelines) poster_to_post_map = {} for result in results['result']: poster_to_post_map[result['_id']] = result['post_id'] return poster_to_post_map </code></pre> Now that I have the mapping, I can query the <code>posters</code> and <code>posts</code> collections seperately for the full documents and then mung them together for display. Now, the problem isn't that it doesn't work, it does... kind of. Say I have a much higher volume of posts and I want to page through a list of posters with their latest post. If my page limit is "10 posters per page" and within the resulting 10 documents there exists a single poster with 2, or more, posts, I actually get back fewer than 10 items in my map. For example, I have 10 posts, 1 poster has 3 posts within the initial result. The aggregation framework will then discard the other 2 posts and associate the latest with that user, resulting in a map containing 8 entries, not 10. This is extremely frustrating as I cannot reliably paginate through the results. Nor can I accurately determine whether or not I'm on the last page of results as a set of results may, or may not, return 0 or more matches. What, if anything, am I doing wrong here? What I am trying to accomplish is simple enough and the aggregation framework seems like a perfect fit for my problem. This would be simple enough if it were a stored proc on a traditional relational database, but that's what we sacrifice when we move to schemaless document stores; relationships are managed outside of the context of the database. Anyhow, the code should be pretty easy to follow and I'll answer any questions you might have. Either way, thanks for taking the time to read. :) Edit: SOLVED Here is a gist of the solution for future viewers: <a href="https://gist.github.com/wilhelm-murdoch/6260469" rel="nofollow">https://gist.github.com/wilhelm-murdoch/6260469</a>
Tags
<python><mongodb><distinct><aggregation-framework>
Title
Distinct sorting and grouping with MongoDB aggregation framework
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USWilhelm Murdoch
UserOwnerUserId
1. USWilhelm Murdoch
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PODistinct sorting and grouping with MongoDB aggregation framework
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PODistinct sorting and grouping with MongoDB aggregation framework
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.