How to choose a collation for international database?

The C collation is the right choice.

Everything is a bit faster without locale. And since no collation is right anyway, create the database without collation, meaning with C.

It may be a pain to have to provide a collation for many operations. There shouldn't be a noticeable difference in speed between the default collation and an ad-hoc collation, though. After all it's just unsorted data, and collation rules are applied when sorting.

Be aware that Postgres builds on the locale settings provided by the underlying OS, so you need to have locales generated for each locale to be used. More in related answer on SO here and here.

However, as @Craig already mentioned, indexes are the bottleneck in this scenario. The collation of the index has to match the collation of the applied operator in many cases that involve character data.

You can use the COLLATE specifier in indexes to produce matching indexes. Partial indexes may be the perfect choice if you are mixing data in the same table.

For example, a table with international strings:

CREATE TABLE string (
   string_id serial
  ,lang_id   int NOT NULL
  ,string    text NOT NULL
);

And you are mostly interested in one language at a time:

SELECT *
FROM   string
WHERE  lang_id = 5  -- 5 being German / Germany here
AND    string > 'foo' COLLATE "de_DE"
ORDER  BY string COLLATE "de_DE";

Then create partial indexes like:

CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;

One for each language you need.

Actually, inheritance might be a superior approach for a table like this. Then you can have a plain index on each inherited table containing only strings for a single locale. You need to be comfortable with the special rules for inherited tables, of course.

I suggest you pick a collation that provides the default Unicode ordering. That way, you get sane results even if you don't override the collation in each query. Unfortunately, most (all?) operating systems don't provide a locale that is simply named "default Unicode" or something like that, so you will have to guess and/or research a good choice. For example, on Linux/glibc, the de_DE.utf8 or en_US.utf8 locales simply pass through the default behavior, so both of those are good choices.

I don't think using the C locale is a good idea, because then the default behavior of your application will be useless. And you might not get proper behavior from case conversion operations.

(Overriding the collation in a query doesn't have much overhead. It's just a parse-time operation.)

How to choose a collation for international database?

Tags:

Postgresql

Database Design

Collation

Index

Related

Recent Posts