Best way to get distinct values from large table

Using a simple index on the columns required (Year and Month) should greatly improve either a DISTINCT, or GROUP BY Query.

I would not go with a secondary table as this adds extra over head to maintaining the secondary table (inserts/updates deletes will require that you validate the secondary table)

EDIT:

You might even want to consider using Improving Performance with SQL Server 2005 Indexed Views


Make sure to have an Clustered Index on those columns. and partition your table on these date columns an place the datafiles on different disk drives I Believe keeping your index fragmentation low is your best shot.

I also Believe having a physical view with the desired select is not a good idea, because it adds Insert/Update overhead. on average there's 3,5 insert's per minute. or about 17 seconds between each insert (on average please correct me if I'm wrong)

The question is are you selecting more often than every 17 seconds? That's the key thought. Hope it helped.


Use a 'Materialized View', also called an 'Indexed View with Schema Binding', and then index this view. When you do this SQL server will essentially create and maintain the data in a secondary table behind the scenes and choose to use the index on this table when appropriate.

This is similar to what your co-worker suggested, the advantage being you won't need to add logic to your query to take advantage of it, SQL Server will do this when it creates a query plan and SQL Server will also automatically maintain the data in the Indexed View.

Here is how you would accomplish this: create a view that returns the distinct [month] [year] values and then index [year] [month] on the view. Again SQL Server will use the tiny index on the view and avoid the table scan on the big table.

Because SQL server will not let you index a view with the DISTINCT keyword, instead use GROUP BY [year],[month] and use BIG_COUNT(*) in the SELECT. It will look something like this:

CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS

   SELECT
     [year],
     [month],
     COUNT_BIG(*) [MonthCount]
   FROM [dbo].[YourBigTable]
   GROUP BY [year],[month]
GO

CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month 
   ON [dbo].[vwMonthYear](Year,Month)

Now when you SELECT DISTINCT [Year],[Month] on the big table, the query optimizer will scan the tiny index on the view instead of scanning millions of records on the big table.

SELECT DISTINCT
   [year],
   [month]
FROM YourBigTable

This technique took me from 5 million reads with an estimated I/O of 10.9 to 36 reads with an estimated I/O of 0.003. The overhead on this will be that of maintaining an additional index, so each time the large table is updated the index on the view will also be updated.

If you find this index is substantially slowing down your load times. Drop the index, perform your data load and then recreate it.

Full working example:

    CREATE TABLE YourBigTable(
        YourBigTableID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_YourBigTable_YourBigTableID PRIMARY KEY,
        [Year] INT,
        [Month] INT)
    GO  


    CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
    AS

       SELECT
          [year],
          [month],
          COUNT_BIG(*) [MonthCount]
       FROM [dbo].[YourBigTable]
       GROUP BY [year],[month]
    GO

    CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month)


    SELECT DISTINCT
       [year],
       [month]
    FROM YourBigTable

-- Actual execution plan shows SQL server scaning ICU_vwMonthYear_Year_Month