Why does base64-encoded data compress so poorly?

Most generic compression algorithms work with a one-byte granularity.

Let's consider the following string:

"XXXXYYYYXXXXYYYY"
  • A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
  • A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
  • A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."

Now let's encode our string in Base64. Here's what we get:

"WFhYWFlZWVlYWFhYWVlZWQ=="

All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.

As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):

Input bytes    : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc

Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):

0         1         2         3         4         5         6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:

Input bytes      : 0x58     0x58     0x58
As binary        : 01011000 01011000 01011000
6-bit repacking  : 00010110 00000101 00100001 00011000
As decimal       : 22       5        33       24
Base64 characters: 'W'      'F'      'h'      'Y'
Output bytes     : 0x57     0x46     0x68     0x59

Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.

Or in other words: we have broken the original byte alignment that most compression algorithms rely on.

Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.

This is even more true for encryption: compress first, encrypt second.

EDIT - A note about LZMA

As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.

Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:

Input bytes      : 0x58     0x58     0x58
As binary        : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes     : 0x57     0x46     0x68     0x59
As binary        : 01010111 01000110 01101000 01011001

Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.


Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).

Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.

Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.

As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.