Optimizing "WHERE x BETWEEN a AND b GROUP BY y" query

Idea 1

Judging by their names, the columns "denormalizedData" and "hugeText" seem to be comparatively big, probably many times as big as the columns involved in your query. Size matters for big queries like this. Very big values (> 2kb) for text or jsonb get "toasted", which can avert the worst. But even the remainder or smaller values stored inline can be several times as big as the columns relevant to your query, which span around 100 bytes.

Configuring PostgreSQL for read performance

Splitting columns relevant to the query into a separate 1:1 table might go a long way. (Depends on the complete situation. You add some storage overhead for another row header and another PK and writing to the tables gets a bit more complicated and expensive.)

Idea 2

Also (like you confirmed) only 4 columns are relevant to determine the top 50.

You might have an angle there for a much smaller materialized view (MV) containing just those columns plus "timestampCol" and "textCol" and only the "last 2 weeks" or "last month" of data. Run a fast query on the MV to identify the top 50 "textCol" and only retrieve those rows from the big table. Or, to be precise, just the additional columns not contained in your MV - you get sums for those in the first step.

You only need an index on ("textCol") for the big table. And another one on ("timestampCol") for the MV - which would only be used for instances of your query with a selective WHERE clause. Else, it will be cheaper to sequentially scan the whole MV.

If many of your queries cover the same period of time, you might go one step further: only save one row per "textCol" in the MV with pre-aggregated sums (maybe two or more MV for a couple of frequently used time periods). You get the idea. That should be much faster, yet.

You might even create the MVs with the whole result set and refresh before the first new query for the day.

Depending on exact numbers, you might combine both ideas.

You are correct that PostgreSQL cannot currently use one index to provide selectivity and another index to provide order, in the way you want. Adding this feature has been discussed, but I don't think anyone considers it a high priority.

Creating a multicolumn index will not help in any way. From my testing Postgres will not change its query plan at all whether it's ("textCol", "timestampCol") or ("timestampCol", "textCol").

PostgreSQL is capable of using an index like ("textCol", "timestampCol") such that "textCol" provides order, and "timestampCol" provide filter selectivity. That is, it can filter on "timestampCol" directly in the index, without having to go to the table for the rows that fail that filter. It can't provide "jump to" selectivity, where you can skip over entries without even inspecting them. If you haven't seen such plans, it is probably because PostgreSQL always finds them inferior to the other possibilities.

Why are you fighting the HashAggregate, which is already very good? It is not clear whether what you want would actually be faster, if it were possible. In 9.6, you will be able to get parallel execution of the hash aggregates. If that isn't enough, I think your only real hope is partitioning the table, and that would only help with some of the use cases you mentioned.

Optimizing "WHERE x BETWEEN a AND b GROUP BY y" query

Idea 1

Idea 2

Tags:

Performance

Postgresql

Optimization

Postgresql 9.4

Index

Related

Recent Posts