Is there a flaw in how clang implements char8_t or does some dark corner of the standard prohibit optimization?

This is not a "bug" in Clang; merely a missed opportunity for optimization.

You can replicate the Clang compiler output by using the same function taking an enum class whose underlying type is unsigned char. By contrast, GCC recognizes a difference between an enumerator with an underlying type of unsigned char and char8_t. It emits the same code for unsigned char and char8_t, but emits more complex code for the enum class case.

So something about Clang's implementation of char8_t seems to think of it more as a user-defined enumeration than as a fundamental type. It's best to just consider it an early implementation of the standard.

It should be noted that one of the most important differences between unsigned char and char8_t is aliasing requirements. unsigned char pointers may alias with pretty much anything else. By contrast, char8_t pointers cannot. As such, it is reasonable to expect (on a mature implementation, not something that beats the standard it implements to market) different code to be emitted in different cases. The trick is that char8_t code ought to be more efficient if it's different, since the compiler no longer has to emit code that performs additional work to deal with potential aliasing from stores.


  1. In libstdc++, std::equal calls __builtin_memcmp when it detects that the arguments are "simple", otherwise it uses a naive for loop. "Simple" here means pointers (or certain iterator wrappers around pointer) to the same integer or pointer type.(relevant source code)

    • Whether a type is an integer type is detected by the internal __is_integer trait, but libstdc++ 8.2.0 (the version used on godbolt.org) does not specialize this trait for char8_t, so the latter is not detected as an integer type.(relevant source code)
  2. Clang (with this particular configuration) generates more verbose assembly in the for loop case than in the __builtin_memcmp case. (But the former is not necessarily less optimized in terms of performance. See Loop_unrolling.)

So there's a reason for this difference, and it's not a bug in clang IMO.