ElasticSearch group by multiple fields

The aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1, field2 and field3:

{
  "aggs": {
    "agg1": {
      "terms": {
        "field": "field1"
      },
      "aggs": {
        "agg2": {
          "terms": {
            "field": "field2"
          },
          "aggs": {
            "agg3": {
              "terms": {
                "field": "field3"
              }
            }
          }          
        }
      }
    }
  }
}

Of course this can go on for as many fields as you'd like.

Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.

{
  "aggregations": {
    "agg1": {
      "buckets": [{
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        },
        {
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        }, ...
      ]
    }
  }
}

The following python code performs the group-by given the list of fields. I you specify include_missing=True, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)

def group_by(es, fields, include_missing):
    current_level_terms = {'terms': {'field': fields[0]}}
    agg_spec = {fields[0]: current_level_terms}

    if include_missing:
        current_level_missing = {'missing': {'field': fields[0]}}
        agg_spec[fields[0] + '_missing'] = current_level_missing

    for field in fields[1:]:
        next_level_terms = {'terms': {'field': field}}
        current_level_terms['aggs'] = {
            field: next_level_terms,
        }

        if include_missing:
            next_level_missing = {'missing': {'field': field}}
            current_level_terms['aggs'][field + '_missing'] = next_level_missing
            current_level_missing['aggs'] = {
                field: next_level_terms,
                field + '_missing': next_level_missing,
            }
            current_level_missing = next_level_missing

        current_level_terms = next_level_terms

    agg_result = es.search(body={'aggs': agg_spec})['aggregations']
    return get_docs_from_agg_result(agg_result, fields, include_missing)


def get_docs_from_agg_result(agg_result, fields, include_missing):
    current_field = fields[0]
    buckets = agg_result[current_field]['buckets']
    if include_missing:
        buckets.append(agg_result[(current_field + '_missing')])

    if len(fields) == 1:
        return [
            {
                current_field: bucket.get('key'),
                'doc_count': bucket['doc_count'],
            }
            for bucket in buckets if bucket['doc_count'] > 0
        ]

    result = []
    for bucket in buckets:
        records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
        value = bucket.get('key')
        for record in records:
            record[current_field] = value
        result.extend(records)

    return result

I think some developers will be definitely looking same implementation in Spring DATA ES and JAVA ES API.

Please finds :-

List<FieldObject> fieldObjectList = Lists.newArrayList();
    SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type)
            .addAggregation(
                    terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2")
                            .subAggregation(AggregationBuilders.terms("ByField3").field("field3")))
                    )
            .build();
    Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() {
        @Override
        public Aggregations extract(SearchResponse aResponse) {
            return aResponse.getAggregations();
        }
    });
    Terms aField1Terms = aField1Aggregations.get("ByField1");
    aField1Terms.getBuckets().stream().forEach(aField1Bucket -> {
        String field1Value = aField1Bucket.getKey();
        Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2");

        aField2Terms.getBuckets().stream().forEach(aField2Bucket -> {
            String field2Value = aField2Bucket.getKey();
            Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3");

            aField3Terms.getBuckets().stream().forEach(aField3Bucket -> {
                String field3Value = aField3Bucket.getKey();
                Long count = aField3Bucket.getDocCount();

                FieldObject fieldObject = new FieldObject();
                fieldObject.setField1(field1Value);
                fieldObject.setField2(field2Value);
                fieldObject.setField3(field3Value);
                fieldObject.setCount(count);
                fieldObjectList.add(fieldObject);
            });
        });
    });

imports need to be done for same :-

import static org.elasticsearch.index.query.QueryBuilders.matchAllQuery;
import static org.elasticsearch.search.aggregations.AggregationBuilders.terms; 
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.collect.Lists;
import org.elasticsearch.index.query.FilterBuilder;
import org.elasticsearch.index.query.FilterBuilders;
import org.elasticsearch.index.query.TermFilterBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.bucket.filter.InternalFilter;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;
import org.springframework.data.elasticsearch.core.ResultsExtractor;
import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;
import org.springframework.data.elasticsearch.core.query.SearchQuery;

You can use Composite Aggregation query as follows. This type of query also paginates the results if the number of buckets exceeds from the normal value of ES. By using the field 'after' you can access the rest of buckets:

"aggs": {
    "my_buckets": {
      "composite": {
        "sources": [
          {
            "field1": {
              "terms": {
                "field": "field1"
              }
            }
          },
          {
            "field2": {
              "terms": {
                "field": "field2"
              }
            }
          },
         {
            "field3": {
              "terms": {
                "field": "field3"
              }
            }
          },
        ]
      }
    }
  }

You can find more detail in ES page bucket-composite-aggregation.


sub-aggregations is what you need .. though this is never explicitly stated in the docs it can be found implicitly by structuring aggregations

It will result the sub-aggregation as if the query was filtered by result of the higher aggregation. It actually looks like as if this is what happens in there.

{
"aggregations": {
    "VALUE1AGG": {
      "terms": {
        "field": "VALUE1",
      },
      "aggregations": {
        "VALUE2AGG": {
           "terms": {
             "field": "VALUE2",
          }
        }
      }
    }
  }
}