Algorithm to find top 10 search terms

Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.

Useful book in this area: Muthukrishnan - "Data Streams: Algorithms and Applications"

Closely related reference to the problem at hand which I picked from the above: Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]

By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.

Heh, nice interview question.


Frequency Estimation Overview

There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.

The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.

Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).

Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).

The Frequent Algorithm

Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.

Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.

  1. Look at each item in the stream.
  2. If the term exists in the map, increment the associated counter.
  3. Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
  4. However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.

Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.

Counting Searches

In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.

Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.