Composite indexes: Most selective column first?

From AskTom

(in 9i, there is a new "index skip scan" -- search for that there to read about that. It makes the index (a,b) OR (b,a) useful in both of the above cases sometimes!)

So, the order of columns in your index depends on HOW YOUR QUERIES are written. You want to be able to use the index for as many queries as you can (so as to cut down on the over all number of indexes you have) -- that will drive the order of the columns. Nothing else (selectivity of a or b does not count at all).

One of the arguments for arranging columns in the composite index in order from the least discriminating(less distinct values) to the most discriminating(more distinct values) is for index key compression.

SQL> create table t as select * from all_objects;

Table created.

SQL> create index t_idx_1 on t(owner,object_type,object_name);

Index created.

SQL> create index t_idx_2 on t(object_name,object_type,owner);

Index created.

SQL> select count(distinct owner), count(distinct object_type), count(distinct object_name ), count(*)  from t;

COUNT(DISTINCTOWNER) COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOBJECT_NAME)      COUNT(*)
-------------------- -------------------------- --------------------------      ----------
                 30                         45                       52205      89807

SQL> analyze index t_idx_1 validate structure; 

Index analyzed.

SQL> select btree_space, pct_used, opt_cmpr_count, opt_cmpr_pctsave from index_stats;

BTREE_SPACE   PCT_USED OPT_CMPR_COUNT OPT_CMPR_PCTSAVE
----------- ---------- -------------- ----------------
    5085584     90          2           28

SQL> analyze index t_idx_2 validate structure; 

Index analyzed.

SQL> select btree_space, pct_used, opt_cmpr_count, opt_cmpr_pctsave  from index_stats; 

BTREE_SPACE   PCT_USED OPT_CMPR_COUNT OPT_CMPR_PCTSAVE
----------- ---------- -------------- ----------------
    5085584     90          1           14

According to the index statistics, the first index is more compressible.

Another is how the index is used in your queries. If your queries mostly use col1,

For example, if you have queries like-

  • select * from t where col1 = :a and col2 = :b;
  • select * from t where col1 = :a;

    -then index(col1,col2) would perform better.

    If your queries mostly use col2,

  • select * from t where col1 = :a and col2 = :b;
  • select * from t where col2 = :b;

    -then index(col2,col1) would perform better. If all of your queries always specify both columns then it doesn't matter which column come first in the composite index.

    In conclusion, the key considerations in column ordering of composite index are index key compression and how you are going to use this index in your queries.

    References:

  • Column order in Index
  • It’s Less Efficient To Have Low Cardinality Leading Columns In An Index (Right) ?
  • Index Skip Scan – Does Index Column Order Matter Any More ? (Warning Sign)


  • Most selective first is useful only when this column is in the actual WHERE clause.

    When the SELECT is by a larger group (less selective), and then possibly by other, non-indexed values, an index with less selective columns may still be useful (if there's a reason not to create another one).

    If there's a table ADDRESS, with

    COUNTRY CITY STREET, something else...

    indexing STREET, CITY, COUNTRY will yield the fastest queries with a street name. But querying all streets of a city, the index will be useless, and the query will likely make a full table scan.

    Indexing COUNTRY, CITY, STREET may be a bit slower for individual streets, but the index can be used for other queries, only selecting by country and/or city.


    When choosing index column order, the overriding concern is:

    Are there (equality) predicates against this column in my queries?

    If a column never appears in a where clause, it's not worth indexing(1)

    OK, so you've got a table and queries against each column. Sometimes more than one.

    How do you decide what to index?

    Let's look at an example. Here's a table with three columns. One holds 10 values, another 1,000, the last 10,000:

    create table t(
      few_vals  varchar2(10),
      many_vals varchar2(10),
      lots_vals varchar2(10)
    );
    
    insert into t 
    with rws as (
      select lpad(mod(rownum, 10), 10, '0'), 
             lpad(mod(rownum, 1000), 10, '0'), 
             lpad(rownum, 10, '0') 
      from dual connect by level <= 10000
    )
      select * from rws;
    
    commit;
    
    select count(distinct few_vals),
           count(distinct many_vals) ,
           count(distinct lots_vals) 
    from   t;
    
    COUNT(DISTINCTFEW_VALS)  COUNT(DISTINCTMANY_VALS)  COUNT(DISTINCTLOTS_VALS)  
    10                       1,000                     10,000     
    

    These are numbers left padded with zeros. This will help make the point about compression later.

    So you've got three common queries:

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001';
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001';
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001'
    and    few_vals = '0000000001';
    

    What do you index?

    An index on just few_vals is only marginally better than a full table scan:

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001';
    
    select * 
    from table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------  
    | Id  | Operation            | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT     |          |      1 |        |      1 |00:00:00.01 |      61 |  
    |   1 |  SORT AGGREGATE      |          |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   2 |   VIEW               | VW_DAG_0 |      1 |   1000 |   1000 |00:00:00.01 |      61 |  
    |   3 |    HASH GROUP BY     |          |      1 |   1000 |   1000 |00:00:00.01 |      61 |  
    |   4 |     TABLE ACCESS FULL| T        |      1 |   1000 |   1000 |00:00:00.01 |      61 |  
    -------------------------------------------------------------------------------------------
    
    select /*+ index (t (few_vals)) */
           count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      58 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      58 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |   1000 |   1000 |00:00:00.01 |      58 |  
    |   3 |    HASH GROUP BY                       |          |      1 |   1000 |   1000 |00:00:00.01 |      58 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |   1000 |   1000 |00:00:00.01 |      58 |  
    |   5 |      INDEX RANGE SCAN                  | FEW      |      1 |   1000 |   1000 |00:00:00.01 |       5 |  
    -------------------------------------------------------------------------------------------------------------
    

    So it's unlikely to be worth indexing on its own. Queries on lots_vals return few rows (just 1 in this case). So this is definitely worth indexing.

    But what about the queries against both columns?

    Should you index:

    ( few_vals, lots_vals )
    

    OR

    ( lots_vals, few_vals )
    

    Trick question!

    The answer is neither.

    Sure few_vals is a long string. So you can get good compression out of it. And you (might) get an index skip scan for the queries using (few_vals, lots_vals) that only have predicates on lots_vals. But I don't here, even though it performs markedly better than a full scan:

    create index few_lots on t(few_vals, lots_vals);
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------  
    | Id  | Operation            | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT     |          |      1 |        |      1 |00:00:00.01 |      61 |  
    |   1 |  SORT AGGREGATE      |          |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   2 |   VIEW               | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   3 |    HASH GROUP BY     |          |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   4 |     TABLE ACCESS FULL| T        |      1 |      1 |      1 |00:00:00.01 |      61 |  
    -------------------------------------------------------------------------------------------  
    
    select /*+ index_ss (t few_lots) */count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    ----------------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
    ----------------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      13 |     11 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   5 |      INDEX SKIP SCAN                   | FEW_LOTS |      1 |     40 |      1 |00:00:00.01 |      12 |     11 |  
    ----------------------------------------------------------------------------------------------------------------------
    

    Do you like gambling? (2)

    So you still need a an index with lots_vals as the leading column. And at least in this case the compound index (few, lots) does the same amount of work as one on just (lots)

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001'
    and    few_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |       3 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   5 |      INDEX RANGE SCAN                  | FEW_LOTS |      1 |      1 |      1 |00:00:00.01 |       2 |  
    -------------------------------------------------------------------------------------------------------------  
    
    create index lots on t(lots_vals);
    
    select /*+ index (t (lots_vals)) */count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001'
    and    few_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    ----------------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
    ----------------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |       3 |      1 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   5 |      INDEX RANGE SCAN                  | LOTS     |      1 |      1 |      1 |00:00:00.01 |       2 |      1 |  
    ----------------------------------------------------------------------------------------------------------------------  
    

    There will be cases where the compound index saves you 1-2 IOs. But is it worth having two indexes for this saving?

    And there's another problem with the composite index. Compare the clustering factor for the three indexes including LOTS_VALS:

    create index lots on t(lots_vals);
    create index lots_few on t(lots_vals, few_vals);
    create index few_lots on t(few_vals, lots_vals);
    
    select index_name, leaf_blocks, distinct_keys, clustering_factor
    from   user_indexes
    where  table_name = 'T';
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    FEW_LOTS    47           10,000         530                
    LOTS_FEW    47           10,000         53                 
    LOTS        31           10,000         53                 
    FEW         31           10             530    
    

    Notice that the clustering factor for few_lots is 10x higher than for lots and lots_few! And this is in a demo table with perfect clustering to begin with. In real world databases the effect is likely to be worse.

    So what's so bad about that?

    The clustering factor is one of the key drivers determining how "attractive" an index is. The higher it is, the less likely the optimizer is to choose it. Particularly if lots_vals aren't actually unique, but still normally have few rows per value. If you're unlucky this could be enough to make the optimizer think a full scan is cheaper...

    OK, so composite indexes with few_vals and lots_vals only have edge case benefits.

    What about queries filtering few_vals and many_vals?

    Single columns indexes only give small benefits. But combined they return few values. So a composite index is a good idea. But which way round?

    If you place few first, compressing the leading column will make that smaller

    create index few_many on t(many_vals, few_vals);
    create index many_few on t(few_vals, many_vals);
    
    select index_name, leaf_blocks, distinct_keys, clustering_factor 
    from   user_indexes
    where  index_name in ('FEW_MANY', 'MANY_FEW');
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    FEW_MANY    47           1,000          10,000             
    MANY_FEW    47           1,000          10,000   
    
    alter index few_many rebuild compress 1;
    alter index many_few rebuild compress 1;
    
    select index_name, leaf_blocks, distinct_keys, clustering_factor 
    from   user_indexes
    where  index_name in ('FEW_MANY', 'MANY_FEW');
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    MANY_FEW    31           1,000          10,000             
    FEW_MANY    34           1,000          10,000      
    

    With fewer different values in the leading column compresses better. So there's marginally less work to read this index. But only slightly. And both are already a good chunk smaller than the original (25% size decrease).

    And you can go further and compress the whole index!

    alter index few_many rebuild compress 2;
    alter index many_few rebuild compress 2;
    
    select index_name, leaf_blocks, distinct_keys, clustering_factor 
    from   user_indexes
    where  index_name in ('FEW_MANY', 'MANY_FEW');
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    FEW_MANY    20           1,000          10,000             
    MANY_FEW    20           1,000          10,000   
    

    Now both indexes are back to the same size. Note this takes advantage of the fact there's a relationship between few and many. Again it's unlikely you'll see this kind of benefit in the real world.

    So far we've only talked about equality checks. Often with composite indexes you'll have an inequality against one of the columns. e.g. queries such as "get the orders/shipments/invoices for a customer in the past N days".

    If you have these kinds of queries, you want the equality against the first column of the index:

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals < '0000000002'
    and    many_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      12 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      12 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |     10 |     10 |00:00:00.01 |      12 |  
    |   3 |    HASH GROUP BY                       |          |      1 |     10 |     10 |00:00:00.01 |      12 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |     10 |     10 |00:00:00.01 |      12 |  
    |   5 |      INDEX RANGE SCAN                  | FEW_MANY |      1 |     10 |     10 |00:00:00.01 |       2 |  
    -------------------------------------------------------------------------------------------------------------  
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001'
    and    many_vals < '0000000002';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    ----------------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
    ----------------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      12 |      1 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      12 |      1 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |  
    |   5 |      INDEX RANGE SCAN                  | MANY_FEW |      1 |      1 |     10 |00:00:00.01 |       2 |      1 |  
    ----------------------------------------------------------------------------------------------------------------------  
    

    Notice they're using the opposite index.

    TL;DR

    • Columns with equality conditions should go first in index.
    • If you have multiple columns with equalities in your query, placing the one with the fewest different values first will give the best compression advantage
    • While index skip scans are possible, you need to be confident this will remain a viable option for the foreseeable future
    • Composite indexes including near-unique columns give minimal benefits. Be sure you really need to save the 1-2 IOs

    1: In some cases it may be worth including a column in an index if this means all the columns in your query are in the index. This enables an index only scan, so you don't need to access the table.

    2: If you're licensed for Diagnostics and Tuning, you could force the plan to a skip scan with SQL Plan Management

    ADDEDNDA

    PS - the docs you've quoted there are from 9i. That's reeeeeeally old. I'd stick with something more recent