Elasticsearch: Working with frequently updated documents

For me, seems that in case of using ES, you should just update all data in index and query it against. If you will split text (as far, as I understand, you store topics in ES for text search) and "digital" data between datastores, you'll experience bigger performance hit, than in case of reindexing docs in ES.

The only thing ES can do with documents in indices - indexing and deleting. So, there are two ways to speedup reindexing

  • speedup "payload" - reduce time taken to remove document and to index it again. This can be achieved moving ES index to memory, to leverage Lucene's RamIndexStore

  • reduce network overhead - perform operations at ES side with scripts

btw, do you experience performance issues already?


Regarding Partial Update to Documents, it is important to recognize that while the API is letting you perform a partial update, behind the scenes, it performs a full update by retrieving the document, changing it and reindexing it. The below is from the Elasticsearch website:

Partial Updates to Documents

In Updating a Whole Document, we said that the way to update a document is to retrieve it, change it, and then reindex the whole document. This is true. However, using the update API, we can make partial updates like incrementing a counter in a single request.

We also said that documents are immutable: they cannot be changed, only replaced. The update API must obey the same rules. Externally, it appears as though we are partially updating a document in place. Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described. The difference is that this process happens within a shard, thus avoiding the network overhead of multiple requests. By reducing the time between the retrieve and reindex steps, we also reduce the likelihood of there being conflicting changes from other processes.

To both store the fulltext data in Elasticsearch and have fields that are changed often without reindexing the entire document, you will need to store those items elsewhere. This can be a metadata / counter store within another Elasticsearch index or another system.

For common use cases, you could run the same query against both and merge the results. These are most likely simple filters and sorts on fields that don't change, e.g. subject, creation time, author, etc.

For searches that won't match, such as full-text queries, you can either (a) not display that data, or (b) use an eventually consistent approach where you periodically update the Elasticsearch topic store with the updated counts. Many systems that don't have high consistency requirements can use the eventually consistency approach, including Stack Overflow, Netflix, etc. For example, on some sites, you'll get one count on one page / widget and another count on another page / widget due to the eventually consistent design.