Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

UTF-32 is a multiple of 16bit. Working with 32 bit quantities is much more common than working with 24 bit quantities and is usually better supported. It also helps keep each character 4-byte aligned (assuming the entire string is 4-byte aligned). Going from 1 byte to 2 bytes to 4 bytes is the most "logical" procession.

Apart from that: The Unicode standard is ever-growing. Codepoints outside of that range could eventually be assigned (it is somewhat unlikely in the near future, however, due to the huge number of unassigned codepoints still available).


Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries.

(I speculate there was also a reluctance to have a limit that was "only what we can currently imagine being useful" when coming up with the original design. After all, that's caused a lot of problems in the past, e.g. with IPv4. While I can't see us ever needing more than 24 bits, if 32 bits is more convenient anyway then it seems reasonable to avoid having a limit which might just be hit one day, via reserved ranges etc.)

I guess this is a bit like asking why we often have 8-bit, 16-bit, 32-bit and 64-bit integer datatypes (byte, int, long, whatever) but not 24-bit ones. I'm sure there are lots of occasions where we know that a number will never go beyond 221, but it's just simpler to use int than to create a 24-bit type.


First there were 2 character coding schemes: UCS-4 that coded each character into 32 bits, as an unsigned integer in range 0x00000000 - 0x7FFFFFFF, and UCS-2 that used 16 bits for each codepoint.

Later it was found out that using just the 65536 codepoints of UCS-2 would get one into problems anyway, but many programs (Windows, cough) relied on wide characters being 16 bits wide, so UTF-16 was created. UTF-16 encodes the codepints in the range U+0000 - U+FFFF just like UCS-2; and U+10000 - U+10FFFF using surrogate pairs, i.e. a pair of two 16-bit values.

As this was a bit complicated, UTF-32 was introduced, as a simple one-to-one mapping for characters beyond U+FFFF. Now, since UTF-16 can only encode up to U+10FFFF, it was decided that this is will be the maximum value that will be ever assigned, so that there will be no further compatibility problems, so UTF-32 indeed just uses 21 bits. As an added bonus, UTF-8, which was initially planned to be a 1-6-byte encoding, now never needs more than 4 bytes for each code point. Therefore it can be easily proven that it never requires more storage than UTF-32.

It is true that a hypothetical UTF-24 format would save memory. However its savings would be dubious anyway, as it would mostly consume more storage than UTF-8, except for just blasts of emoji or such - and not many interesting texts of significant length consist solely of emojis.

But, UTF-32 is used as in memory representation for text in programs that need to have simply-indexed access to codepoints - it is the only encoding where the Nth element in a C array is also the Nth codepoint - UTF-24 would do the same for 25 % memory savings but more complicated element accesses.


It's true that only 21 bits are required (reference), but modern computers are good at moving 32-bit units of things around and generally interacting with them. I don't think I've ever used a programming language that had a 24-bit integer or character type, nor a platform where that was a multiple of the processor's word size (not since I last used an 8-bit computer; UTF-24 would be reasonable on an 8-bit machine), though naturally there have been some.