Base64 vs HEX for sending binary content over the internet in XML doc

I was curious how on EARTH base64 can convert 3 input bytes into 4 output bytes for just 33% space growth (whereas hex converts 1 input byte into 2 output bytes for 100% space growth). Why specifically 3 input bytes?

The answer is:

3 bytes = 3 x 8 bits = 24 bits.

Why that magic "24 bits" number? Well, base 64 represents the numbers 0 to 63. How are those represented in binary? With 000000 (0) to 111111 (63).

Bingo! Each base64 character represents 6 bits of input data using a single output byte (a single character such as "Z", etc).

So 24 bits (3 full 8-bit bytes of input) / 6 bits (base64 alphabet) = 4 bytes of base64. That's it!

Or, described another way, every Base64 character (which is 1 byte (8 bits)) encodes 6 bits of real data. And if we divide 8bits/6bits we see where the 33% growth comes from, as mentioned at the top of this post... So yes, Base64 always increases data size by 33% (plus some potential padding by the = characters that are sometimes added at the end of the base64 output).

You may think "Why not base128 (7 bits of input = 8 bits of output), at just 14% size growth when encoding?". The answer for that is that base64 is the best we can find, since the lower 128 ASCII characters aren't all printable. Many are control characters such as NULL etc.

There are obviously ways to create other systems such as perhaps "base81" etc, since you can do anything you want if you create a custom encoding algorithm. But the beauty of base64 is how it encodes data so cleanly in chunks of 6 bits, and how you simply have to "read 3 bytes and output 4" to encode, and "read 4 bytes and output 3" to decode. So that encoding scheme became popular.

Now you are hopefully wiser after having read this.

Fun Update: Speaking of other encoding styles with more characters... It's come to my attention that Ascii85 aka Base85 exists and is slightly more efficient (25% data size growth when encoding as Base85 instead of 33% for Base64): https://en.wikipedia.org/wiki/Ascii85


You could just write your own method for Base64 as well... but I'd generally recommend using external, well-tested libraries for both. (It's not like there's any shortage of them.)

The difference between Base64 and hex is really just how bytes are represented. Hex is another way of saying "Base16". Hex will take two characters for each byte - Base64 takes 4 characters for every 3 bytes, so it's more efficient than hex. Assuming you're using UTF-8 to encode the XML document, a 100K file will take 200K to encode in hex, or 133K in Base64. Of course it may well be that you don't care about the space efficiency - in many cases it won't matter. If it does matter, then clearly Base64 is better on that front. (There are alternatives which are even more efficient, but they're not as common.)

Tags:

Java