compression algorithm for sorted integers

You can compress optimally if you know the true distribution of the data. If you can provide a probability distribution for each integer you can use arithmetic coding or other entropy coding techniques to compress to theoretical minimal size.

The trick is in predicting accurately.

First, you should probably compress the distances between the numbers because that allows you to make statistical statements. If you were to compress the numbers directly you'd have a hard time modelling them because they occur only once.

Next, you could try to build a very simple model to predict the next distance. Keep a histogram of all previously seen distances and calculate the probabilities from the frequencies.

You probably need to account for missing values (you clearly can't assign them 0 probability because that is not expressible) but you can use heuristics for that, like encoding the next distance bit-by-bit and predicting each bit individually. You will pay almost nothing for the high-order bits because they are almost always 0 and entropy encoding optimizes them away.

All of this is much simpler if you know the distribution. Example: You you are compressing a list of all prime numbers you know the theoretical distribution of distances because there are formulae for that. So you already have a perfect model.


There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.

It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either

  • a single 1-bit, representing a delta of 2k which is added to the delta in the following code, or

  • a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.

For example, if k is 4, the sequence:

00011 1 1 00000 1 00001

codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2·24, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 24 plus 1, total 17, and the third number is 35+17=52.

The 1-bit indicates that the next delta should be incremented by 2k (or, more generally, each delta is incremented by 2k times the number of immediately preceding 1-bits.)

Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1i0(1|0)k, representing a delta of i·2k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.

Since each "1" code represents an increment of 2k, there cannot be more than m/2k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n·(k + 1) where n is the size of the set. The optimal value of k is roughly log2 m/n, which in your case would be 7 or 8.

I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log2 m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log2 m/n) but it's not far off.

Compression code:

void Compress(std::ostream& os,
              const std::vector<unsigned long>& v,
              unsigned long k = 0) {
  BitOut out(os);
  out.put(v.size(), 64);
  if (v.size()) {
    unsigned long twok;
    if (k == 0) {
      unsigned long ratio = v.back() / v.size();
      for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
    } else {
      twok = 1 << k;
    }
    out.put(k, 32);

    unsigned long prev = 0;
    for (unsigned long val : v) {
      while (val - prev >= twok) { out.put(1); prev += twok; }
      out.put(0);
      out.put(val - prev, k);
      prev = val;
    }
  }
  out.flush(1);
}

Decompression:

std::vector<unsigned long> Decompress(std::istream& is) {
  BitIn in(is);
  unsigned long size = in.get(64);
  if (size) {
    unsigned long k = in.get(32);
    unsigned long twok = 1 << k;

    std::vector<unsigned long> v;
    v.reserve(size);
    unsigned long prev = 0;
    for (; size; --size) {
      while (in.get()) prev += twok;
      prev += in.get(k);
      v.push_back(prev);
    }
  }
  return v;
}

It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.

A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a&centerdot;2k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every Nth 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).



I want to add another answer with the simplest possible solution:

  1. Convert the numbers to deltas as discussed previously
  2. Run it through the 7-zip LZMA2 algorithm. It is even multi-core ready

I think this will give almost perfect results in your case because the distances have a simple distribution. 7-zip will be able to pick it up.


One option that worked well for me in the past was to store a list of 64-bit integers as 8 different lists of 8-bit values. You store the high 8 bits of the numbers, then the next 8 bits, etc. For example, say you have the following 32-bit numbers:

0x12345678
0x12349785
0x13111111
0x13444444

The data stored would be (in hex):

12,12,13,13
34,34,11,44
56,97,11,44
78,85,11,44

I then ran that through the deflate compressor.

I don't recall what compression ratios I was able to achieve with this, but it was significantly better than compressing the numbers themselves.