Indexing arrays in CosmosDB

"New" index layout

As stated in Index Types

Azure Cosmos containers support a new index layout that no longer uses the Hash index kind. If you specify a Hash index kind on the indexing policy, the CRUD requests on the container will silently ignore the index kind and the response from the container only contains the Range index kind. All new Cosmos containers use the new index layout by default.

The below issue does not apply to the new index layout. There the default indexing policy works fine (and delivers the results in 36.55 RUs). However pre-existing collections may still be using the old layout.

"Old" index layout

I was able to reproduce the issue with ARRAY_CONTAINS that you are asking about.

Setting up a CosmosDB collection with 100,000 posts from the SO data dump (e.g. this question would be represented as below)

{
    "id": "50614926",
    "title": "Indexing arrays in CosmosDB",
     /*Other irrelevant properties omitted */
    "tags": [
        "azure",
        "azure-cosmosdb"
    ]
}

And then performing the following query

SELECT COUNT(1)
FROM t IN c.tags
WHERE t = 'sql-server'

The query took over 2,000 RUs with default indexing policy and 93 with the following addition (as shown in your linked article)

{
    "path": "/tags/[]/?",
    "indexes": [
        {
            "kind": "Hash",
            "dataType": "String",
            "precision": -1
        }
    ]
}

However what you are seeing here is not that the array values aren't being indexed by default. It is just that the default range index is not useful for your query.

The range index uses keys based on partial forward paths. So will contain paths such as the following.

  • tags/0/azure
  • tags/0/c#
  • tags/0/oracle
  • tags/0/sql-server
  • tags/1/azure-cosmosdb
  • tags/1/c#
  • tags/1/sql-server

With this index structure it starts at tags/0/sql-server and then reads all of the remaining tags/0/ entries and the entirety of the entries for tags/n/ where n is an integer greater than 0. Each distinct document mapping to any of these needs to be retrieved and evaluated.

By contrast the hash index uses reverse paths (more details - PDF)

StackOverflow theoretically allows a maximum of 5 tags per question to be added by the UI so in this case (ignoring the fact that a few questions have more tags through site admin activities) the reverse paths of interest are

  • sql-server/0/tags
  • sql-server/1/tags
  • sql-server/2/tags
  • sql-server/3/tags
  • sql-server/4/tags

With the reverse path structure finding all paths with leaf nodes of value sql-server is straight forward.

In this specific use case as the arrays are bounded to a maximum of 5 possible values it is also possible to use the original range index efficiently by looking at just those specific paths.

The following query took 97 RUs with default indexing policy in my test collection.

SELECT COUNT(1)
FROM c
WHERE  'sql-server' IN (c.tags[0], c.tags[1], c.tags[2], c.tags[3], c.tags[4])