Is C++20 'char8_t' the same as our old 'char'?

char8_t is not the same as char. It behaves exactly the same as unsigned char though per [basic.fundamental]/9

Type char8_­t denotes a distinct type whose underlying type is unsigned char. Types char16_­t and char32_­t denote distinct types whose underlying types are uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>.

emphasis mine


Do note that since the standard calls it a distinct type, code like

std::cout << std::is_same_v<unsigned char, char8_t>;

will print 0(false), even though char8_t is implemented as a unsigned char. This is because it is not an alias, but a distinct type.


Another thing to note is that char can either be implemented as a signed char or unsigned char. That means it is possible for char to have the same range and representation as char8_t, but they are still separate types. char, signed char, unsigned char, and char8_t are the same size, but they are all distinct types.


Disclaimer: I'm the author of the char8_t P0482 and P1423 proposals.

In C++20, char8_t is a distinct type from all other types. In the related proposal for C, N2231 (which is in need of an update and re-proposal to WG14), char8_t would be a typedef of unsigned char similar to the existing typedefs for char16_t and char32_t.

In C++20, char8_t has an underlying representation that matches unsigned char. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char, but has different aliasing rules.

In particular, char8_t was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

The motivation for a distinct type with these properties is:

  1. To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.

  2. To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).

  3. To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).

  4. To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.

Tags:

C++

C++20

C++14