Elasticsearch paginating a sorted, aggregated result

Field collapsing is the answer.

Field collapsing feature is used when we want to group the hits on a specific field (as in group by agg_field).

Before Elastic 6, the way to group the fields is to use aggregation. This approach was lacking an ability to do efficient paging.

But now, with the field collapse provided out of the box by elastic, it is pretty easy.

Below is a sample query with field collapse taken from above link.

GET /twitter/_search
{
  "query": {
      "match": {
          "message": "elasticsearch"
      }
  },
  "collapse" : {
      "field" : "user", 
      "inner_hits": {
          "name": "last_tweets", 
          "size": 5, 
          "sort": [{ "date": "asc" }] 
      },
      "max_concurrent_group_searches": 4 
  },
  "sort": ["likes"]

}


The composite aggregation might help here as it allows you to group by multiple fields and then paginate over the results. The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary.

So here is a sample query to do that:

POST testindex6/_search
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

In the response you'll see and after_key structure:

  "after_key": {
    "store": "http://google.com1087",
    "status": "OK1087",
    "title": "Titanic1087"
  },

It's some kind of cursor that you need to use in your subsequent queries, like this:

{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ],
        "after": {
          "store": "http://google.com1087",
          "status": "OK1087",
          "title": "Titanic1087"
        }
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

And it will give you the next 100 buckets. Hopefully this helps.

UPDATE:

If you want to know how many buckets in total there is going to be, the composite aggregation won't give you that number. However, since the composite aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) of each field used in the composite aggregation and multiplying them together.

  "aggs": {
    "my_buckets": {
      "composite": {
        ...
      }
    },
    "store_cardinality": {
      "cardinality": {
        "field": "store_url"
      }
    },
    "status_cardinality": {
      "cardinality": {
        "field": "status"
      }
    },
    "title_cardinality": {
      "cardinality": {
        "field": "title"
      }
    }
  }

We can then get the total number of buckets by multiplying the figure we get in store_cardinality, status_cardinality and title_cardinality together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones).