Is there any reason to prefer UTF-16 over UTF-8?

I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.

I imagine there are two main reasons why Windows NT uses UTF-16 internally:

  • For memory usage: UTF-32 wastes a lot of space to encode.
  • For performance: UTF-8 is much harder to decode than UTF-16. In UTF-16 characters are either a Basic Multilingual Plane character (2 bytes) or a Surrogate Pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes.

Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use StringInfo.GetTextElementEnumerator().

For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings


East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.


@Oak: this too long for a comment...

I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.

Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.

Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).

So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.

:)


It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.

Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.