How does Lucene (Solr/ElasticSearch) so quickly do filtered term counts?

You may already know this but it won't hurt to say that Lucene uses inverted indexing. In this indexing technique a dictionary of each word occurring in all documents is made and against each word information about that words occurrences is stored. Something like this image enter image description here

To achieve this Lucene stores documents, indexes and their Metadata in different files formats. Follow this link for file details http://lucene.apache.org/core/3_0_3/fileformats.html#Overview

If you read the document numbers section, each document is given an internal ID so when documents are found with word 'consign' the lucene engine has the reference to the metadata of it. Refer to the overview section to see what data gets saved in different lucene indexes. Now that we have a pointer to the stored documents Lucene may be getting it in one of the following ways

  1. Really count the number of words if the document is stored
  2. Use Term Dictionary, frequency, and proximity data to get the count.

Finally, which API are you using to "quickly determine counts for all words"

Image credit to http://leanjavaengineering.wordpress.com/

Check about index file format here http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/codecs/lucene80/package-summary.html#package.description


there are no bitsets involved: its an inverted index. Each term maps to a list of documents. In lucene the algorithms work on iterators of these "lists", so items from the iterator are read on-demand, not all at once.

this diagram shows a very simple conjunction algorithm that just uses a next() operation: http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html

Behind the scenes, it is much like this diagram in lucene. Our lists are delta-encoded and bitpacked, and augmented with a skiplist which allows us to intersect more efficiently (via the additional advance() operation) than the above algorithm though.

DocIDSetIterator is this "enumerator" in lucene. it has the two main methods, next() and advance(). And yes, it is true you can decide to read the entire list + convert it into a bitset in memory, and implement this iterator over that in-memory bitset. This is what happens if you use CachingWrapperFilter.