PostgreSQL: difference between collations 'C' and 'C.UTF-8'
The PostgreSQL documentation leaves a lot to be desired (just sayin' 😼 ).
To start with, there is only one encoding for a particular database, so
C.UTF-8 in your UTF-8 database are both using the UTF-8 encoding.
For libc collations: typically collation names, by convention, are truly two-part names of the following structure:
A "locale" (i.e. "culture") is the set of language-specific rules for sorting (
LC_COLLATE) and capitalization (
LC_CTYPE). Even though there is sometimes overlap, this really doesn't have anything to do with how this data is stored.
An "encoding" is how the data is stored (i.e. what byte sequence equates to which character). Even though there is sometimes overlap, this really doesn't have anything to do with the sorting and capitalization rules of any particular language that uses the encoding (some encodings can be used by multiple languages that can have quite different rules in one or both of those areas).
To illustrate, consider storing Korean data:
ko_KRis the locale.
- Possible encodings that work with this locale are:
EUC_KR(Extended UNIX Code-KR)
UHC(Unified Hangul Code / Windows949)
UTF8(Unicode's 8-bit encoding)
Also consider the following, taken from the "Collation Support: libc collations" documentation (emphasis added):
For example, the operating system might provide a locale named
initdbwould then create a collation named
UTF8... It will also create a collation with the
.utf8tag stripped off the name. So you could also use the collation under the name
de_DE, which is less cumbersome to write and makes the name less encoding-dependent...
Within any particular database, only collations that use that database's encoding are of interest. Other entries in
pg_collationare ignored. Thus, a stripped collation name such as
de_DEcan be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one less thing you need to change if you decide to change to another database encoding. Note however that the
POSIXcollations can be used regardless of the database encoding.
Meaning, in a database that uses the UTF-8 encoding,
en_US.UTF8 are equivalent. BUT, between that database and a database that uses the
LATIN1 encoding, the
en_US collations are not equivalent.
So, does this mean that
C.UTF-8 are the same?
NO, that would be too easy!!! The
C collation is an exception to the above-stated behavior. The
C collation is a simple set of rules that is available regardless of the database's encoding, and the behavior should be consistent across encodings (which is made possible by only recognizing the US English alphabet — "a-z" and "A-Z" — as "letters", and sorting by byte value, which should be the same for the encodings available to you).
C.UTF-8 collation is actually a slightly enhanced set of rules, as compared to the base
C rules. This difference can actually be seen in
pg_collation since the values for the
collctype columns are different between the rows for
I put together a set of test queries to illustrate some of the similarities and differences between these two collations, as well as compared to
en_GB (and implicitly
en_GB.utf8). I started with the queries provided in Daniel Vérité's answer, enhanced them to hopefully be clearer about what is and is not being shown, and added a few queries. The results show us that:
C.UTF-8are actually different sets of rules, even if only slightly different, based on their respective values in the
C.UTF-8expands the characters that are considered "letters"
en_GB), recognizes invalid Unicode code points (i.e. U+0378) and sorts them towards the top
en_GB), sorts non-US-English-letter characters by code point
ucs_basicappears to be equivalent to
C(which is stated in the documentation)
You can find, and execute, the queries on: db<>fiddle
Is it perhaps the case that C.UTF-8 is the same as C with encoding UTF-8
No. Consider for instance these differences in an UTF-8 database, on Debian 10 Linux:
postgres=# select upper('é' collate "C"), upper('é' collate "C.UTF-8"); upper | upper -------+------- é | É (1 row) postgres=# select ('A' < E'\u0378' collate "C"), ('A' < E'\u0378' collate "C.UTF-8"); ?column? | ?column? ----------+---------- t | f (1 row)
(U+0378 does not correspond to any valid character in Unicode).
Another example with a valid Unicode character (the left side is 'THUMBS UP SIGN' U+1F44D):
=> select '' < 'A' collate "C"; ?column? ---------- f (1 row) => select '' < 'A' collate "C.UTF-8"; ?column? ---------- t (1 row)
lc_collate is "C" (or "POSIX"), the comparison is done internally by PostgreSQL. In that case, it compares the byte representations of the strings using
In the other cases where libc is the provider (
pg_collation), the comparison is done by
strcoll_l from the C library, so PostgreSQL itself is not responsible for the result and, as shown by the counter-examples above, there's no reason to believe that it will be identical.
That's true at least for libc-backed collations. Starting with Postgres version 10, ICU collations may be used. These collations are consistent across operating systems.
The gory details can be found in the source code in backend/utils/adtvarlena.c, especially the