Trigram search gets much slower as search string gets longer

In PostgreSQL 9.6 there will be a new version of pg_trgm, 1.2, which will be much better about this. With a little effort, you can also get this new version to work under PostgreSQL 9.4 (you have to apply the patch, and compile the extension module yourself and install it).

What the oldest version does is search for each trigram in the query and take the union of them, and then apply a filter. What the new version will do is pick the rarest trigram in the query and search for just that one, and then filter on the rest later.

The machinery to do this does not exist in 9.1. In 9.4 that machinery was added, but pg_trgm wasn't adapted to make use of it at that time.

You would still have a potential DOS issue, as the malicious person can craft a query which has only common trigrams. like '%and%', or even '%a%'

If you can't upgrade to pg_trgm 1.2, then another way to trick the planner would be:

WHERE (lower(unaccent(label)) like lower(unaccent('%someword%'))) 
AND   (lower(unaccent(label||'')) like 
      lower(unaccent('%someword and some more%')));

By concatenating the empty string to label, you trick the planner into thinking it can't use the index on that part of the where clause. So it uses the index on just the %someword%, and applies a filter to just those rows.

Also, if you are always searching for entire words, you could use a function to tokenize the string into an array of words, and use a regular built-in GIN index (not pg_trgm) on that array-returning function.

I have found a way to scam the query planner, it is a quite simple hack:

SELECT *
FROM (
   select id, title, label
   from   table1
   where  lower(unaccent(label)) like lower(unaccent('%someword%'))
   ) t1
WHERE lower(lower(unaccent(label))) like lower(unaccent('%someword and more%'))

EXPLAIN output:

Bitmap Heap Scan on table1  (cost=6749.11..7332.71 rows=1 width=212) (actual time=256.607..256.609 rows=1 loops=1)
  Recheck Cond: (lower(unaccent((label_hun)::text)) ~~ '%someword%'::text)
  Filter: (lower(lower(unaccent((label)::text))) ~~ '%someword and some more%'::text)
  ->  Bitmap Index Scan on table1_label_hun_gin_trgm  (cost=0.00..6749.11 rows=147 width=0) (actual time=256.499..256.499 rows=1 loops=1)
        Index Cond: (lower(unaccent((label)::text)) ~~ '%someword%'::text)
Total runtime: 256.653 ms

So, as there is no index for lower(lower(unaccent(label))), this would create a sequential scan, so it gets turned into a simple filter. What is more, a simple AND will also do the same:

SELECT id, title, label
FROM table1
WHERE lower(unaccent(label)) like lower(unaccent('%someword%'))
AND   lower(lower(unaccent(label))) like lower(unaccent('%someword and more%'))

Of course, this is a heuristic that may not work well, if the cut-out part used in the index scan is very common. But in our database, there is not really that much repetition, if I use about 10-15 characters.

There are two small questions remaining:

Why can't postgres figure out that something like this would be beneficial?
What does postgres do in the 0..256.499 time range (see analyze output)?

Trigram search gets much slower as search string gets longer

Tags:

Postgresql

Full Text Search

Pattern Matching

Postgresql 9.1

Postgresql 9.4

Related

Recent Posts