wchar_t vs wint_t

wint_t is capable of storing any valid value of wchar_t. A wint_t is also capable of taking on the result of evaluating the WEOF macro (note that a wchar_t might be too narrow to hold the result).


As @musiphil so nicely put in his comment, which I'll try to expand here, there is a conceptual difference between wint_t and wchar_t.

Their different sizes are a technical aspect that derives from the fact each has very distinct semantics:

  • wchar_t is large enough to store characters, or codepoints if you prefer. As such, they are unsigned. They are analogous to char, which was, in virtually all platforms, limited to 8-bit 256 values. So wide-char strings variables are naturally arrays or pointers of this type.

  • Now enter string functions, some of which need to be able to return any wchar_t plus additional statuses. So their return type must be larger than wchar_t. So wint_t is used, which can express any wide char and also WEOF. Being a status, it can also be negative (and usually is), hence wint_t is most likely signed. I say "possibly" because the C standard does not mandate it to be. But regardless of sign, status values need to be outside the range of wchar_t. They are only useful as return vales, and never meant to store such characters.

The analogy with "classic" char and int is great to clear any confusion: strings are not of type int [], they are char var[] (or char *var). And not because char is "half the size of int", but because that's what a string is.

Your code looks correct: c is used to check the result of getwch() so it is wint_t. And if its value is not WEOF, as your if tests, then it's safe to assign it to a wchar_t character (or a string array, pointer, etc)


UTF-8 is one possible encoding for Unicode. It defines 1, 2, 3 or 4 bytes per character. When you read it through getwc(), it will fetch one to four bytes and compose from them a single Unicode character codepoint, which would fit within a wchar (which can be 16 or even 32 bits wide, depending on platform).

But since Unicode values map to all of the values from 0x0000 to 0xFFFF, there are no values left to return condition or error codes in. (Some have pointed out that Unicode is larger than 16 bits, which is true; in those cases surrogate pairs are used. But the point here is that Unicode uses all of the available values leaving none for EOF.)

Various error codes include EOF (WEOF), which maps to -1. If you were to put the return value of getwc() in a wchar, there would be no way to distinguish it from a Unicode 0xFFFF character (which, BTW, is reserved anyway, but I digress).

So the answer is to use a wider type, an wint_t (or int), which holds at least 32 bits. That gives the lower 16 bits for the real value, and anything with a bit set outside of that range means something other than a character returning happened.

Why don't we always use wchar then instead of wint? Most string-related functions use wchar because on most platforms it's ½ the size of wint, so strings have a smaller memory footprint.

Tags:

C

String