PostgreSQL UTF-8 binary collation

The C locale will do. UTF-8 is designed so that byte ordering is also codepoint ordering. This is not trivial but consider how UTF-8 works:

Number range  Byte 1   Byte 2   Byte 3
0000-007F     0xxxxxxx
0080-07FF     110xxxxx 10xxxxxx
0800-FFFF     1110xxxx 10xxxxxx 10xxxxxx

When sorting binary data aka C locale, the first non-equal byte will determine the ordering. What we neeed to see that if two numbers encoded into UTF-8 differ then the first non-equal byte will be lower for the lower value. If the numbers are in different ranges then the first byte will indeed be lower for the lower number. Within the same range, the order is determined by literally the same bits as without encoding.

Sort order of text depends on lc_collate (not on the system locale!). The system locale only serves as a default when creating the db cluster if you don't provide another locale.

The behaviour you are expecting only works with locale C. Read all about it in the fine manual:

The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.

Emphasis mine. PostgreSQL 9.1 has a couple of new features for collation. Might be exactly what you are looking for.

PostgreSQL UTF-8 binary collation

Tags:

Postgresql

Utf 8

Collation

Related

Recent Posts