Mongodb aggregation $group followed by $limit for pagination

I have solved the problem without need of maintaining another collection or even without $group traversing whole collection, hence posting my own answer.

As others have pointed:

  1. $group doesn't retain order, hence early sorting is not of much help.
  2. $group doesn't do any optimization, even if there is a following $limit, i.e., runs $group on entire collection.

My usecase has following unique features, which helped me to solve it:

  1. There will be maximum of 10 records per each student (minimum of 1).
  2. I am not very particular on page size. The front-end capable of handling varying page sizes. The following is the aggregation command I have used.

    db.classtest.aggregate(
    [
        {$sort: {name: 1}},
        {$limit: 5 * 10},
        {$group: {_id: '$name',
            total: {$sum: '$marks'}}},
        {$sort: {_id: 1}}
    ])
    

Explaining the above.

  1. if $sort immediately precedes $limit, the framework optimizes the amount of data to be sent to next stage. Refer here
  2. To get a minimum of 5 records (page size), I need to pass at least 5 (page size) * 10 (max records per student) = 50 records to the $group stage. With this, the size of final result may be anywhere between 0 and 50.
  3. If the result is less than 5, then there is no further pagination required.
  4. If the result size is greater than 5, there may be chance that last student record is not completely processed (i.e., not grouped all the records of student), hence I discard the last record from the result.
  5. Then name in last record (among retained results) is used as $match criteria in subsequent page request as shown below.

    db.classtest.aggregate(
    [
        {$match: {name: {$gt: lastRecordName}}}
        {$sort: {name: 1}},
        {$limit: 5 * 10},
        {$group: {_id: '$name',
            total: {$sum: '$marks'}}},
        {$sort: {_id: 1}}
    ])
    

In above, the framework will still optimize $match, $sort and $limit together as single operation, which I have confirmed through explain plan.


The first few things to consider here is that the aggregation framework works with a "pipeline" of stages to be applied in order to get a result. If you are familiar with processing things on the "command line" or "shell" of your operating system, then you might have some experience with the "pipe" or | operator.

Here is a common unix idiom:

ps -ef | grep mongod | tee "out.txt"

In this case the output of the first command here ps -ef is being "piped" to the next command grep mongod which in turn has it's output "piped" to the tee out.txt which both outputs to terminal as well as the specified file name. This is a "pipeline" wher each stage "feeds" the next, and in "order" of the sequence they are written in.

The same is true of the aggregation pipeline. A "pipeline" here is in fact an "array", which is an ordered set of instructions to be passed in processing the data to a result.

db.classtest.aggregate([
    { "$group": {
      "_id": "$name",
      "total": { "$sum": "$marks"}
    }},
    { "$sort": { "name": 1 } },
    { "$limit": 5 }
])  

So what happens here is that all of the items in the collection are first processed by $group to get their totals. There is no specified "order" to grouping so there is not much sense in pre-ordering the data. Neither is there any point in doing so because you are yet to get to your later stages.

Then you would $sort the results and also $limit as required.

For your next "page" of data you will want ideally $match on the last unique name found, like so:

db.classtest.aggregate([
    { "$match": { "name": { "$gt": lastNameFound } }},
    { "$group": {
      "_id": "$name",
      "total": { "$sum": "$marks"}
    }},
    { "$sort": { "name": 1 } },
    { "$limit": 5 }
])  

It's not the best solution, but there really are not alternatives for this type of grouping. It will however notably get "faster" with each iteration towards the end. Alternately, storing all the unqiue names ( or reading that out of another collection ) and "paging" through that list with a "range query" on each aggregation statement may be a viable option, if your data permits it.

Something like:

db.classtest.aggregate([
    { "$match": { "name": { "$gte": "Allan", "$lte": "David" } }},
    { "$group": {
      "_id": "$name",
      "total": { "$sum": "$marks"}
    }},
    { "$sort": { "name": 1 } },
])  

Unfortunately there is not a "limit grouping up until x results" option, so unless you can work with another list, then you are basically grouping up everything ( and possibly a a gradually smaller set each time ) with each aggregation query you send.