Indexing & alternatives for low-selectivity columns

Out of all the approaches you have listed only one (use sequential read) is approach that has anything to do with low selectivity (well, clustered can qualify, too).

If you have low selectivity on a column this means that scans will perform better than lookup.

Index can be used to do

  • index lookups - check the index pointer, retrieve record, repeat
  • index scans - scan the index and get values directly from index

otherwise it is not very useful.

If the selectivity is low that means that a large part of the index would be read and, if using lookups, large part of the data would be then read, in some random order. This is inefficient if you cover a significant percentage of the underlying table, so the better method would be to do sequential read (which is also slow).

So if selectivity is low, there's nothing much you can do (clustering can help).

However, I am not convinced that you understand that in your example you do not have low selectivity. As you say most entries will be paid, and very little entries will be allocated. These (allocated) entries will have high selectivity. Especially if there are additional conditions and if there is a composite index containing those additional conditions.

So, you might be banging your head against a non problem.

Now, it is true that you might improve performance further by partitioning data or using supplementary table (if you need to).


I agree with Unreason's However branch. But there are some things to know about this case.

This is called skew and skew kills. This is a perfect use for a partial index where you'd exclude the 95% of paid invoices and only index the more interesting and selective stats. But you don't have that. You can horizontally partition all the rows into separate table/partitions but then you need to account for row migration (moving from one status to another) and that's expensive. The DBMS has to perform an Update, a Delete and an insert to change the status. If you're a high volume system that will hurt.

Forget what you said about whether or not to index based on selectivity because putting an index on a rapidly changing column is also usually a bad idea. Your index will have hot blocks where all the step 1's are being removed and another where all the step 2's are being inserted and oh btw, some step 2's are being removed at the same time into step 3's. This won't scale well.

I would recommend vertically partitioning your status into a separate table(s).

Your invoice table will have a PK and all the columns except status.

Your status you can handle two ways. That table will have the PK value as an FK back to the invoice table, the Status and a timestamp for when you entered that status. The best is a horizontally partitioned table on status. You'll have a partition for each status possible. So finding all or one "Placed" status will partition prune and read only the partition it needs - which is a very small number of blocks. Because the row is so narrow, you might get 400 invoice statuses on a single block. Looking up that status of any one invoice is easy since there's a global index on the PK.

If your RDBMS doesn't support partitioning with row migration, you'll need to manage these partitions as tables and delete from one and insert into another. You'll encapsulate these movements in a transaction in a procedure, so you keep the data clean. Every invoice is in one and only one status table. The harder part is querying by invoice ID, you'll have to check every table to see where it is.

You have another choice You can either write paid statuses or not. If it's a partitioned table, you can just delete the invoice from the invoice status table when it moves to paid. (Of course you'll write a paid record to the history table mentioned in the bonus material). Then you'll do an outer join to the status table and nulls mean paid. If you almost never query for paid status, there's really no reason to make that a fast query.

Bonus Material

in either case you'll want to keep track of these movements in a reporting table. Everytime you update a status, you'll want to write that to a history table. Eventually you'll want to analyze what I call transit times. What's the average time from filled to paid, by month? Is that increasing as a result of the bad economy? what's the transit time from placed to filled, by month. Do the summer months take longer because of missing bodies on vacation? you get the point. By updating that column you're losing those answers, so you'll need to embed that history log into your procedures.