Lossless compression technique for digital signals in an embedded system

Compression is all about finding the redundancies in the data and removing them. Since you don't seem to be able to tell us much about your actual data sets, this answer will have to be very generic.

I gather that "potentiostat" data is continuous, varies slowly in general, but might have small deviations from sample to sample. One good way to encode this in 15-minute (450 sample) blocks would be to fit a first-, second- or third-order (or more, depending on the general nature of your data) polynomial to the block of data to capture its overall shape.¹

The block would then be encoded as the parameters to that polynomial (perhaps four 16-bit numbers), plus the individual sample deviations from that polynomial, which presumably can be much smaller numbers — perhaps 450 3- or 4-bit numbers, for a total of 1414 or 1864 bits instead of the original 7200 bits — a compression ratio of roughly 4 or 5 to 1.

If you find that you can't use a fixed width for the sample deltas, consider using Huffman encoding to represent them — small values get short codes, while the presumably rarer large values get longer codes. You should still be able to get a compression ratio around 3 to 1, again, depending on your data.

¹ If it turns out that the data has a cyclic component, it might be more useful to use autoregression or Fourier analysis to identify the key periodic components (frequency, phase and amplitude), and then record the individual sample deltas from the function defined by those parameters.

Your data will likely have two components to it:

Low speed changes of the actual voltage
Random variation due to ADC noise

It is likely that the random variation will present most of the entropy in the data. You can reduce it by oversampling: for example taking 4 samples at a time and dividing result by 4 would cut the noise in half.

There is no widely popular compression format for lossless compression of ADC samples. Free Lossless Audio Codec is sometimes used, but it is quite computationally heavy and may not be suitable for your system. Generic lossless compression formats such as DEFLATE and LZ4 take less computational resources, but the compression ratio will not be very good unless you add custom preprocessing steps. So it may be best to design a custom compression scheme.

The generic structure for compression methods usually consists of two parts:

Preprocessing to improve compressibility: for example, taking the difference of two consecutive values, or applying polynomial prediction.
Entropy coding to represent each symbol with the least amount of bits needed. One common method is Huffman codes, using either constant or dynamically built tables.

One custom format I've used with success is to compute the delta between consecutive samples and encode the difference using Elias gamma coding. It has the benefit of being simple to implement, and achieving less than 1 byte per sample coding for slowly changing signals.

The best encoding to use is going to depend a lot on the distribution of your samples. You've told us the deltas are mostly quite small, which means your first step will almost certainly be delta-encoding (transforming each value into its difference from the previous value.)

Another constraint will be the system you are doing the encoding on -- you've said it's "embedded", but that spans quite a range of capability. You've said also that SD cards are out of scope, and that you're only buffering 450 samples at a time in RAM, which suggests a very small system indeed. In that case, optimizing for simplicity and conservation of CPU/RAM seems in order.

If the most common delta value is exactly 0 -- that is, lots samples are the same as the previous sample -- it's probably a good idea to first "run-length encode" those runs of 0 values. (I.e. just storing how many there were in a row.)

The rest depends further on what the distribution of values looks like. I will presume for the sake of exercise that they are almost all in the range -64 < x < 63 (i.e. a 7-bit signed integer). I'm also assuming it's easiest to work with bytes rather than bits (which is probably true if e.g. you're writing C) -- if that's not true, see the very bottom of the answer for a bit-wise scheme. A very simple byte-wise encoding could look something like this:

0b0xxxxxxx - a literal value (delta) represented as 7-bit signed integer in the "xxxxxxx" part. (Values from -64 to 63.)

0b10xxxxxx - a run of zeros (deltas), with length represented by "xxxxxx" (6 bits unsigned can express up to 63, and if we need more we can just add another entry.)

0b110xxxxx 0byyyyyyyy - a literal value (delta) represented as a 13-bit signed integer in the "xxxxxyyyyyyyy" part.

0b11111111 0bxxxxxxxx 0byyyyyyyy - a literal value (delta) represented as a 16-bit signed integer. This is a very inefficient encoding (obviously) since it turns a 16-bit value into a 3-byte representation. It needlessly wastes space in order to keep the output byte-aligned. This scheme only makes sense if deltas this large are very rare. (Every nontrivial compression scheme will have some inputs for which the resulting output is actually larger; this is a theorem of information theory.)

(The above scheme is slightly inspired by the UTF-8 encoding of Unicode.)

Unlike Huffman codes (mentioned in another answer), the assumed distribution of values is fixed in advance. This is a virtue because it keeps things simple, and it avoids adding overhead to the start of every block of samples; it's a vice because a more adaptive scheme would not require hand-tuning to the distribution.

If deltas much smaller than -64 to 63 are common, a better byte-wise encoding than the above will need to process more than one sample at a time, in order to get better than 2:1 compression (that is, more than one sample per output byte.)

If bitwise encoding is ok, then a much simpler-to-describe scheme is as follows: still delta encode first, then encode as follows. A 0 bit is followed by a variable-length positive integer encoding the number of zeros to follow; a 1 bit is followed by a sign bit, then a variable-length positive integer, together encoding the next (delta) value. The variable-length positive integers can be encoded using one of the codes from https://en.wikipedia.org/wiki/Universal_code_(data_compression), such as one of the Elias codes. (Which encoding is best will again depend on the distribution of the data, but probably any of them will do great.)

Lossless compression technique for digital signals in an embedded system

Tags:

Signal Processing

Compression

Embedded

Sampling

Related

Recent Posts