Compression algorithms for numbers only

This is one of these places where a little theory is helpful. You need to think about several things:

  • what is the resolution of your measurements: 0.1° or 0.001°? 1 second or one microsecond?
  • are the measurements associated and in some order, or tossed together randomly?

Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.

Similarly with date time, what's the range; how many bits must you have?

Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.

The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.


Update: Oh, wait, fixed point strings?!

Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have

040.00105.0020090518212100Z

which you could convert to

|  4000            | short int, 16 bits |  
| 10500            | short int, 16 bits |  
| 20090518212100Z  | 64 bits            |

So that's 96 bits, 12 bytes versus 26 bytes.


Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).

Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.

I recommend you have a look at The Data Compression Book


What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.

Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.

In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.

An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.

That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.