Does Postgres optimize this JOIN with subquery?

To get an actual performance boost, LEFT JOIN to the aggregating subquery, but repeat the (selective!) predicate of the outer query.

SELECT po.number
     , SUM(grouped_items.total_quantity) AS total_quantity
FROM   purchase_orders po
LEFT   JOIN (
   SELECT purchase_order_id AS id
        , SUM(quantity) AS total_quantity
   FROM   items
   WHERE  purchase_order_id IN (1, 2, 3)  -- repeat selective condition !
   GROUP  BY 1
   ) grouped_items USING (id)
WHERE  po.id IN (1, 2, 3)
GROUP  BY po.number;

Luckily, this is possible in your case: the predicate is applicable in the subquery. A bit more verbose, but typically delivers best performance regardless of what percentage of rows in items is involved. My rule of thumb is "aggregate first and join later". See:

Postgres CTE optimization with nested json_build_object

Other cases are not so lucky. Then you have to decide which way to go. Postgres is not smart enough to optimize much, as jjanes explained. If all or most rows from items are involved, aggregating in a subquery is typically (much) faster. If only few rows are involved, correlated subqueries or an equivalent LATERAL subquery is typically (much) faster. See:

What is the difference between LATERAL and a subquery in PostgreSQL?

For only 3 rows from the outer query (WHERE po.id IN (1, 2, 3)), a correlated subquery can't go wrong. But I assume that's just simplification for the demo.

I've been looking into this issue a bit myself lately, and my conclusion is that the planner is not smart enough to optimize this particular thing. The correlated subselect will be executed once for each row even if that is a huge number of rows, while the uncorrelated one will be executed to completion even if only a few rows from it are needed.

It does know that one will be faster than the other (assuming the estimated row counts are reasonably correct) but it lacks the capacity to recognize the two formulations are identical and so choose between execution plans based on estimated performance.

Although in your case, the queries would not be identical because they handle missing rows in "items" differently. The correlated subselect would be identical to the left join, not the inner join.

Does Postgres optimize this JOIN with subquery?

Tags:

Postgresql

Join

Subquery

Related

Recent Posts