C++ utf-8 literals in GCC and MSVC

They're both wrong.

As far as I can tell, the C++17 standard says here that:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.

There are tickets for this which are currently marked as Under Investigation:

  • https://developercommunity.visualstudio.com/content/problem/225847/hex-escape-codes-in-a-utf8-literal-are-treated-in.html
  • https://developercommunity.visualstudio.com/content/problem/260684/escape-sequences-in-unicode-string-literals-are-ov.html

However it also says here about UTF-8 literals that:

If the value is not representable with a single UTF-8 code unit, the program is ill-formed.

Since 0xA0 is not a valid UTF-8 character, the program should not compile.

Note that:

  • UTF-8 literals starting with u8 are defined as being narrow.
  • \xA0 is an escape sequence
  • \u00A0 is considered a universal character name and not an escape sequence

This is CWG issue 1656.

It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.

Therefore GCC's behavior is/will be correct.