Working of indexes in PostgreSQL

This answer is about (default) B-tree indexes. See this later, related answer about GiST, GIN etc.:

  • Why is a GiST index used for filtering on non-leading column?

Here are the results of querying a table on the second column of a multicolumn index.
The effects are easy to reproduce for anybody. Try it at home.

I tested with PostgreSQL 9.0.5 on Debian using a medium sized table of a real-life database with 23322 rows. It implements the n:m relationship between the tables adr (address) and att (attribute), but that's not relevant here. Simplified schema:

CREATE TABLE adratt (
  adratt_id serial PRIMARY KEY
, adr_id    integer NOT NULL
, att_id    integer NOT NULL
, log_up    timestamp NOT NULL DEFAULT (now()::timestamp)
, CONSTRAINT adratt_uni UNIQUE (adr_id, att_id)
);

The UNIQUE constraint effectively implements a unique index. I repeated the test with a plain index to be sure and got identical results as expected.

CREATE INDEX adratt_idx ON adratt(adr_id, att_id);

The table is clustered on the adratt_uni index and before the test I ran:

CLUSTER adratt;
ANALYZE adratt;

Sequential scans for queries on (adr_id, att_id) are as fast as they can possibly be. The multicolumn index can still be used for a query condition on the second index column alone.

I ran the queries a couple of times to populate the cache and the picked the best of ten runs to get comparable results.

1. Query using both columns

SELECT *
FROM   adratt
WHERE  att_id = 90
AND    adr_id = 10;
 adratt_id | adr_id | att_id |       log_up
-----------+--------+--------+---------------------
       123 |     10 |     90 | 2008-07-29 09:35:54
(1 row)

Output of EXPLAIN ANALYZE:

Index Scan using adratt_uni on adratt  (cost=0.00..3.48 rows=1 width=20) (actual time=0.022..0.025 rows=1 loops=1)
  Index Cond: ((adr_id = 10) AND (att_id = 90))
Total runtime: 0.067 ms

2. Query using first column

SELECT * FROM adratt WHERE adr_id = 10;
 adratt_id | adr_id | att_id |       log_up
-----------+--------+--------+---------------------
       126 |     10 |     10 | 2008-07-29 09:35:54
       125 |     10 |     13 | 2008-07-29 09:35:54
      4711 |     10 |     21 | 2008-07-29 09:35:54
     29322 |     10 |     22 | 2011-06-06 15:50:38
     29321 |     10 |     30 | 2011-06-06 15:47:17
       124 |     10 |     62 | 2008-07-29 09:35:54
     21913 |     10 |     78 | 2008-07-29 09:35:54
       123 |     10 |     90 | 2008-07-29 09:35:54
     28352 |     10 |    106 | 2010-11-22 12:37:50
(9 rows)

Output of EXPLAIN ANALYZE:

Index Scan using adratt_uni on adratt  (cost=0.00..8.23 rows=9 width=20) (actual time=0.007..0.023 rows=9 loops=1)
  Index Cond: (adr_id = 10)
Total runtime: 0.058 ms

3. Query using second column

SELECT * FROM adratt WHERE att_id = 90;
 adratt_id | adr_id | att_id |       log_up
-----------+--------+--------+---------------------
       123 |     10 |     90 | 2008-07-29 09:35:54
       180 |     39 |     90 | 2008-08-29 15:46:07
...
(83 rows)

Output of EXPLAIN ANALYZE:

Index Scan using adratt_uni on adratt  (cost=0.00..818.51 rows=83 width=20) (actual time=0.014..0.694 rows=83 loops=1)
  Index Cond: (att_id = 90)
Total runtime: 0.849 ms

4. Disable indexscan & bitmapscan

SET enable_indexscan = off;
SELECT * FROM adratt WHERE att_id = 90;

Output of EXPLAIN ANALYZE:

Bitmap Heap Scan on adratt  (cost=779.94..854.74 rows=83 width=20) (actual time=0.558..0.743 rows=83 loops=1)
  Recheck Cond: (att_id = 90)
  ->  Bitmap Index Scan on adratt_uni  (cost=0.00..779.86 rows=83 width=0) (actual time=0.544..0.544 rows=83 loops=1)
        Index Cond: (att_id = 90)
Total runtime: 0.894 ms

SET enable_bitmapscan = off;
SELECT * FROM adratt WHERE att_id = 90;

Output of EXPLAIN ANALYZE:

Seq Scan on adratt  (cost=0.00..1323.10 rows=83 width=20) (actual time=0.009..2.429 rows=83 loops=1)
  Filter: (att_id = 90)
Total runtime: 2.680 ms

Conclusion

As expected, the multi-column index is used for a query on the second column alone.
As expected, it is less efficient, but the query is still 3x faster than without the index.
After disabling index scans, the query planner chooses a bitmap heap scan, which performs almost as fast. Only after disabling that, too, it falls back to a sequential scan.

See other answer for the original quote from the manual.

Updates since Postgres 9.0

Everything still basically true in Postgres 13. Most notable changes:

  • index-only scans in Postgres 9.2
  • True covering indexes with the INCLUDE keyword in Postgres 11
  • Multiple performance and space improvements (especially for multicolumn indexes) in Postgres 12
  • Deduplication in Postgres 13.

All in favor of index performance. (Sequential scans got faster, too, though.)


re 1) Yes and no.

For a query that uses both columns e.g. where (user_id1, user_id2) = (1,2) it doesn't matter which index is created.

For a query that has a condition on only one of the columns e.g. where user_id1 = 1 it does matter because usually only the "leading" columns can be used for a comparison by the optimizer. So where user_id1 = 1 would be able to use the index (user_id1, user_id2) but it would not be able to an index (user_id2, user_id1) for all cases.

After playing around with this (after Erwin so kindly showed us a setup where it works), it seems that this depends highly on the data distribution of the second column although I haven't yet found out which situation enables the optimizer to use trailing columns for a WHERE condition.

Oracle 11 which can also (sometimes) use columns that are not at the beginning of the index definition.

re 2) Yes it will create an index

Quote from the manual

Adding a primary key will automatically create a unique btree index on the column or group of columns used in the primary key.

re 2a) Primary Key (user_id1,user_id2) will create an index on (user_id1,user_id2) (which you can find out by yourself very easily by simply creating such a primary key)

I highly recommend that you read the chapter about indexes in the manual, it basically answers all the questions above.

Additionally, What index to create? by depesz does a good job explaining order on index columns and other index related topics.


Ad 1)
There are limitations in PostgreSQL like @a_horse_with_no_name describes. Up until version 8.0 multicolumn indexes could only be used for queries on the leading column(s). This has been improved in version 8.1. The current manual for Postgres 10 (updated) explains:

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a — but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.

Emphasis mine. I can confirm that from experience.
Also see the test case added my later answer here.