Lucene 4 Pagination

I've been using Lucene 4.8 and have been working on a REST interface which includes pagination. My solution has been to use a TopScoreDocCollector and call the topDocs(int startIndex, int numberOfhits) method. The start index is calculated by multiplying the zero based page number by the number of hits.

...
DirectoryReader reader = DirectoryReader.open(MMapDirectory.open( java.io.File(indexFile) );
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_RESULTS, true);  // MAX_RESULTS is just an int limiting the total number of hits 
int startIndex = (page -1) * hitsPerPage;  // our page is 1 based - so we need to convert to zero based
Query query = new QueryParser(Version.LUCENE_48, "All", analyzer).parse(searchQuery);
searcher.search(query, collector);
TopDocs hits = collector.topDocs(startIndex, hitsPerPage);
...

So my REST interface accepts the page number and number of hits per page as parameters. So going forward or back is as simple as submitting a new request with the appropriate value for the page


I agree with the solution explained by Jaimie. But I want to point out another aspect you have to be aware of and which is helping to understand the general mechanism of a search engine.

With the TopDocCollector you can define how much hits you want to be collected matching your search query, before the result is sorted by score or other sort criterias.

See the following example:

collector = TopScoreDocCollector.create(9999, true);
searcher.search(parser.parse("Clone Warrior"), collector);
// get first page
topDocs = collector.topDocs(0, 10);
int resultSize=topDocs.scoreDocs.length; // 10 or less
int totalHits=topDocs.totalHits; // 9999 or less

We tell Lucene here to collect a maximum of 9999 documents containing the search phrase 'Clone Warrior'. This means, if the index contains more than 9999 documents containing this search phrase, the collector will stop after it is filled up with 9999 hits!

This means, that as greater you choose the MAX_RESULTS as better become your search result. But this is only relevant if you expect a large number of hits. On the other side if you search for "luke skywalker" and you will expect only one hit, than the MAX_RESULTS can also be set to 1.

So changing the MAX_RESULTS can influence the returned scoreDocs as the sorting will be performed on the collected hits. It is practically to set MAX_RESULTS to a size which is large enough so that the human user can not argue to miss a specific document. This concept is totally contrary to the behavior of a SQL database, which does always consider the complete data pool.

But lucene also supports another mechanism. You can, instead of defining the MAX_RESULTS for the collector, alternatively define the amount of time you want to wait for the resultset. So for example you can define that you always want to stop the collector after 300ms. This is a good approach to protect your application for performance issues. But if you want to make sure that you count all relevant documents than you have to set the parameter for MAX_RESULTS or the maximum wait time to a endless value.