Java - what are characters, code points and surrogates? What difference is there between them?

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 2¹⁶.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java uses UTF-16 internally to represent text.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

You can find a short explanation in the Javadoc for the class java.lang.Character:

Unicode Character Representations

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. [..]

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

In other words:

A code point usually represents a single character. Originally, the values of type char matched exactly the Unicode code points. This encoding was also known as UCS-2.

For that reason, char was defined as a 16-Bit type. However, there are currently more than 2^16 characters in Unicode. To support the whole character set, the encoding was changed from the fixed-length encoding UCS-2 to the variable-length encoding UTF-16. Within this encoding, each code point is represented by a single char or by two chars. In the latter case, the two chars are called a surrogate pair.

UTF-16 was defined in such a way, that there is no difference between text encoded with UTF-16 and UCS-2, if all code points are below 2^14. That means, char can be used to represent some but not all characters. If a character can not be represented within a single char, the term char is misleading, because it is just used as as 16-Bit word.

Code points typically refers to Unicode codepoints. The Unicode glossary says this:

Codepoint(1): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16.

In Java, a character (char) is an unsigned 16 bit value; i.e 0 to FFFF.

As you can see, there are more Unicode codepoints that can be represented as Java characters. And yet Java needs to be able to represent text using all valid Unicode codepoints.

The way that Java deals with this is to represent codepoints that are larger than FFFF as a pair of characters (code units); i.e. a surrogate pair. These encode a Unicode codepoint that is larger than FFFF as a pair of 16 bit values. This uses the fact that a subrange of the Unicode code space (i.e. D800 to U+DFFF) is reserved for representing surrogate pairs. The technical details are here.

The proper term for the encoding that Java is using is the UTF-16 Encoding Form.

Another term that you might see is code unit which is the minimum representational unit used in a particular encoding. In UTF-16 the code unit is 16 bits, which corresponds to a Java char. Other encodings (e.g. UTF-8, ISO 8859-1, etc) have 8 bit code units, and UTF-32 has a 32 bit code unit.

The term character has many meanings. It means all sorts of things in different contexts. The Unicode glossary gives 4 meanings for Character as follows:

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.

Character. (2) Synonym for abstract character. (Abstract Character. A unit of information used for the organization, control, or representation of textual data.)

Character. (3) The basic unit of encoding for the Unicode character encoding.

Character. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

And then there is the Java specific meaning for character; i.e. a 16 bit signed number (of type char) that may or may not represent a complete or partial Unicode codepoint in UTF-16 encoding.

Java - what are characters, code points and surrogates? What difference is there between them?

Tags:

Java

Character Encoding

Character

Related

Recent Posts