Algorithm to find the smallest snippet from searching a document?

I already posted a rather straightforward algorithm that solves exactly that problem in this answer

Google search results: How to find the minimum window that contains all the search keywords?

However, in that question we assumed that the input is represented by a text stream and the words are stored in an easily searchable set.

In your case the input is represented slightly differently: as a bunch of vectors with sorted positions for each word. This representation is easily transformable to what is needed for the above algorithm by simply merging all these vectors into a single vector of (position, word) pairs ordered by position. It can be done literally, or it can be done "virtually", by placing the original vectors into the priority queue (ordered in accordance with their first elements). Popping an element from the queue in this case means popping the first element from the first vector in the queue and possibly sinking the first vector into the queue in accordance with its new first element.

Of course, since your statement of the problem explicitly fixes the number of words as three, you can simply check the first elements of all three arrays and pop the smallest one at each iteration. That gives you a O(N) algorithm, where N is the total length of all arrays.

Also, your statement of the problem seems to suggest that target words can overlap in the text, which is rather strange (given that you use the term "word"). Is it intentional? In any case, it doesn't present any problem for the above linked algorithm.


Unless I've overlooked something, here's a simple, O(n) algorithm:

  1. We'll represent the snippet by (x, y) where x and y are where the snippet begins and ends respectively.
  2. A snippet is feasible if it contains all 3 search words.
  3. We will start with the infeasible snippet (0,0).
  4. Repeat the following until y reaches end-of-string:
    1. If the current snippet (x, y) is feasible, proceed to the snippet (x+1, y)
      Else (the current snippet is infeasible) proceed to the snippet (x, y+1)
  5. Choose the shortest snippet among all feasible snippets we went through.

Running time - in each iteration either x or y is increased by 1, clearly x can't exceed y and y can't exceed string length so total number of iterations is O(n). Also, feasibility can be checked at O(1) in this case since we can track how many occurences of each word are within the current snippet. We can maintain this count at O(1) with each increase of x or y by 1.

Correctness - For each x, we calculate the minimal feasible snippet (x, ?). Thus we must go over the minimal snippet. Also, if y is the smallest y such that (x, y) is feasible then if (x+1, y') is a feasible snippet y' >= y (This bit is why this algorithm is linear and the others aren't).

Tags:

Algorithm