What would break if the C locale was UTF-8 instead of ASCII?

The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.

If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:

I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C

Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.

The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.

So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).

What would break if the C locale was UTF-8 instead of ASCII?

Tags:

Unicode

Compatibility

Character Encoding

Posix

Locale

Related

Recent Posts