How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.

See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub

Note: Actually, sorting is not required in itself. What is required is that distinct values are contiguous (so there is no need for ordering to be defined) - we get this from sorting but perhaps there is a way of doing this more efficiently.

You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.



About this code sample:

  • Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.

  • The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.

import java.util.*;
import java.util.Map.Entry;

class TopN {
    private final int maxSize;
    private Map<Long, Long> countMap;

    public TopN(int maxSize) {
        this.maxSize = maxSize;
        this.countMap = new HashMap(maxSize);
    }

    private void addOrReplace(long value, long count) {
        if (countMap.size() < maxSize) {
            countMap.put(value, count);
        } else {
            Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
            Entry<Long, Long> minEntry = opt.get();
            if (minEntry.getValue() < count) {
                countMap.remove(minEntry.getKey());
                countMap.put(value, count);
            }
        }
    }

    public Set<Long> get() {
        return countMap.keySet();
    }

    public void process(long[] data) {
        long value = data[0];
        long count = 0;

        for (long current : data) {
            if (current == value) {
                ++count;
            } else {
                addOrReplace(value, count);
                value = current;
                count = 1;
            }
        }
        addOrReplace(value, count);
    }

    public static void main(String[] args) {
        long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
        TopN topMap = new TopN(2);

        topMap.process(data);
        System.out.println(topMap.get()); // [5, 6]
    }
}


Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.

But that can't hold frequencies higher than 255, you would say? Yes, you're right.

So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.


We have two data structures:

  • a large array, indexed by the number read (0..2^31) of bytes.
  • a hashmap of (number read, frequency)

Algorithm:

 while reading next number 'x'
 {
   if (hashmap.contains(x))
   {
     hashmap[x]++;
   }
   else
   {
     bigarray[x]++;
     if (bigarray[x] > 250)
     {
       hashmap[x] = bigarray[x];
     }
   }
 }

 // when done:
 // Look up top-100 in hashmap
 // if not 100 yet, add more from bigarray, skipping those already taken from the hashmap

I'm not fluent in Java, so can't give a better code example.


Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.

All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.


The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.


In pseudocode:

  1. Perform an external sort
  2. Do a pass to collect the top 100 frequencies (not which values have them)
  3. Do another pass to collect the values that have those frequencies

Assumption: There are clear winners - no ties (outside the top 100).

Time complexity: O(n log n) (approx) due to sort. Space complexity: Available memory, again due to sort.

Steps 2 and 3 are both O(n) time and O(1) space.


If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.

If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.

Tags:

Algorithm

Java