Does Unicode have a defined maximum number of code points?

The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).

This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.

However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.

1,114,112 - 2,048 = 1,112,064

There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, â¦ 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:

The Unicode Standard, Version 7.0, contains 112,956 characters

So only about 10% of the available code points have been allocated.

I can't account for why you found 1,112,114 as the number of code points.

Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.

Yes, all the code points that can't be represented in UTF-16 (including using surrogates) have been declared invalid.

U+10FFD seems to be the highest code point, but the surrogates, U+00FFFE and U+00FFFF aren't usable code points so the total count is a bit lower.

Does Unicode have a defined maximum number of code points?

Tags:

Unicode

Utf 8

Utf 16

Utf 32

Codepoint

Related

Recent Posts