PostgreSQL: difference between collations 'C' and 'C.UTF-8'

The PostgreSQL documentation leaves a lot to be desired (just sayin' 😼 ).

To start with, there is only one encoding for a particular database, so C and C.UTF-8 in your UTF-8 database are both using the UTF-8 encoding.

For libc collations: typically collation names, by convention, are truly two-part names of the following structure:

{locale_name}.{encoding_name}

A "locale" (i.e. "culture") is the set of language-specific rules for sorting (LC_COLLATE) and capitalization (LC_CTYPE). Even though there is sometimes overlap, this really doesn't have anything to do with how this data is stored.

An "encoding" is how the data is stored (i.e. what byte sequence equates to which character). Even though there is sometimes overlap, this really doesn't have anything to do with the sorting and capitalization rules of any particular language that uses the encoding (some encodings can be used by multiple languages that can have quite different rules in one or both of those areas).

To illustrate, consider storing Korean data:

ko_KR is the locale.
Possible encodings that work with this locale are:
- EUC_KR (Extended UNIX Code-KR)
- JOHAB
- UHC (Unified Hangul Code / Windows949)
- UTF8 (Unicode's 8-bit encoding)

Also consider the following, taken from the "Collation Support: libc collations" documentation (emphasis added):

For example, the operating system might provide a locale named de_DE.utf8. initdb would then create a collation named de_DE.utf8 for encoding UTF8 ... It will also create a collation with the .utf8 tag stripped off the name. So you could also use the collation under the name de_DE, which is less cumbersome to write and makes the name less encoding-dependent...

...

Within any particular database, only collations that use that database's encoding are of interest. Other entries in pg_collation are ignored. Thus, a stripped collation name such as de_DE can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one less thing you need to change if you decide to change to another database encoding. Note however that the default, C, and POSIX collations can be used regardless of the database encoding.

Meaning, in a database that uses the UTF-8 encoding, en_US and en_US.UTF8 are equivalent. BUT, between that database and a database that uses the LATIN1 encoding, the en_US collations are not equivalent.

So, does this mean that C and C.UTF-8 are the same?

NO, that would be too easy!!! The C collation is an exception to the above-stated behavior. The C collation is a simple set of rules that is available regardless of the database's encoding, and the behavior should be consistent across encodings (which is made possible by only recognizing the US English alphabet — "a-z" and "A-Z" — as "letters", and sorting by byte value, which should be the same for the encodings available to you).

The C.UTF-8 collation is actually a slightly enhanced set of rules, as compared to the base C rules. This difference can actually be seen in pg_collation since the values for the collcollate and collctype columns are different between the rows for C and C.UTF-8.

I put together a set of test queries to illustrate some of the similarities and differences between these two collations, as well as compared to en_GB (and implicitly en_GB.utf8). I started with the queries provided in Daniel Vérité's answer, enhanced them to hopefully be clearer about what is and is not being shown, and added a few queries. The results show us that:

C and C.UTF-8 are actually different sets of rules, even if only slightly different, based on their respective values in the collcollate and collctype columns in pg_collation (final query)
C.UTF-8 expands the characters that are considered "letters"
C.UTF-8, unlike C (but like en_GB), recognizes invalid Unicode code points (i.e. U+0378) and sorts them towards the top
C.UTF-8, like C (but unlike en_GB), sorts non-US-English-letter characters by code point
ucs_basic appears to be equivalent to C (which is stated in the documentation)

You can find, and execute, the queries on: db<>fiddle

Is it perhaps the case that C.UTF-8 is the same as C with encoding UTF-8

No. Consider for instance these differences in an UTF-8 database, on Debian 10 Linux:

postgres=# select upper('é' collate "C"), upper('é' collate "C.UTF-8");
 upper | upper 
-------+-------
 é     | É
(1 row)

postgres=# select ('A' < E'\u0378' collate "C"),
                  ('A' < E'\u0378' collate "C.UTF-8");
 ?column? | ?column? 
----------+----------
 t        | f
(1 row)

(U+0378 does not correspond to any valid character in Unicode).

Another example with a valid Unicode character (the left side is 'THUMBS UP SIGN' U+1F44D):

=> select '' < 'A' collate "C";
 ?column? 
----------
 f
(1 row)

=> select '' < 'A' collate "C.UTF-8";
 ?column? 
----------
 t
(1 row)

When lc_collate is "C" (or "POSIX"), the comparison is done internally by PostgreSQL. In that case, it compares the byte representations of the strings using memcmp.

In the other cases where libc is the provider (collprovider='c' in pg_collation), the comparison is done by strcoll_l from the C library, so PostgreSQL itself is not responsible for the result and, as shown by the counter-examples above, there's no reason to believe that it will be identical.

That's true at least for libc-backed collations. Starting with Postgres version 10, ICU collations may be used. These collations are consistent across operating systems.

The gory details can be found in the source code in backend/utils/adtvarlena.c, especially the varstrmp_cmp function.

PostgreSQL: difference between collations 'C' and 'C.UTF-8'

Tags:

Postgresql

Encoding

Collation

Locales

Related

Recent Posts