Finding missing gaps of data in a table with ~2.5 Million rows

To start, while with only 202 months to check it won't be a huge issue, a recursive CTE is generally the worst possible way to derive a set, in terms of performance (I prove this here and here).

If you're going to be running this query more than once (and it sounds like you will be, until you solve the separate issue of who/what is deleting this data and creating the gaps in the first place), why not just build a months table that will always be there?

CREATE TABLE dbo.Months([Month] date PRIMARY KEY);

DECLARE @StartDate     date = '20000101', 
        @NumberOfYears int  = 30;

INSERT dbo.Months([Month])
  SELECT TOP (12*@NumberOfYears) 
  DATEADD(MONTH, ROW_NUMBER() OVER (ORDER BY number) -1, @StartDate) 
FROM master.dbo.spt_values;

30 years of months, which will work through the year 2029, stored in a whopping 72kb. When I first wrote this I sarcastically emphasized whopping, but I should explain why this has 9 pages instead of the expected 2. In current versions of SQL Server (I initially tested this on SQL Server 2016, but the same is true in v.Next), the storage engine reserves an entire, uniform extent for new objects. This is 8 x 8kb pages, plus the IAM page for 72kb - in this case only one of the data pages is actually required, so 7 remain unallocated. This means they won't show up in all catalog views, but they're still easy to find (click to enlarge):

enter image description here

You can turn this behavior off for user databases, but personally I wouldn't (they made it the default for a reason). Your first instinct might be about saving memory rather than disk space, but while this puts 72kb on disk, only 16kb will ever be loaded into the buffer pool. So no need to panic about that.

Now your query can be:

DECLARE @startDate date = '20000101', @endDate date = '20161101';

;WITH shortcodes AS
(
  SELECT DISTINCT ShortCode 
  FROM dbo.VWTBL_INDICATOR
  WHERE MonthYear >= @startDate AND MonthYear <= @endDate
)
SELECT m.[Month], s.ShortCode 
FROM dbo.Months AS m
CROSS JOIN shortcodes AS s
LEFT OUTER JOIN dbo.VWTBL_INDICATOR AS vwtbl
ON s.ShortCode = vwtbl.ShortCode
AND m.[Month] = vwtbl.MonthYear
WHERE m.[Month] >= @startDate AND m.[Month] <= @endDate
AND vwtbl.MonthYear IS NULL;

Note that currently this will identify all months in your defined range where a ShortCode doesn't appear, even if it's outside the range that is valid for that ShortCode. If those valid ranges per ShortCode are defined somewhere, please add that information to the question.

What on earth is a "VWTBL"?


I'm going to address the question you didn't ask: why is my data disappearing?

Data can't just disappear from SQL tables on its own (without corruption), there must be something deleting it.

It could be a malicious user or something, but in my experience it is much more likely to be something like a poorly written archive routine that is catching more rows than intended. Are there maintenance routines that run on the database to clean up old records?

You mentioned you contract out some of the database support, can you raise this as a high-priority issue with them? Could be one of their routines doing it.

Also, these rows might not be deleted how you think: maybe there is a badly written query that UPDATES a bunch of rows with the wrong date, and a different routine that flags them as invalid/duplicate and DELETES them or something.

Finally, is this table partitioned? If it is partitioned by date, and you do some fancy rolling date windows there could be issues with how exactly that is set up.

But from scratch, here is what I would check:

1. Check the Database for Corruption

If you aren't doing it routinely, do a DBCC CHECKDB on the database during off hours. If it returns an error, you may have a bigger problem.

2. Lock down your user security

Identify the types of access that different groups of people need, and give them the bare minimum necessary. You can do this on the database level (via roles), or at the individual table level (via explicit permissions).

Only running reports? Read only.

Doing data imports? INSERT, but not UPDATE or DELETE.

3. Run a trace to watch database activity

You can run a Profiler Trace (or start a server-side trace) to see when the deletes occur. Add a filter for DELETE to reduce the number of rows captured.

4. Track deletes on the table

There are a few ways to track any delete statements that occur, discussed in this question. In your situation, sounds like a table trigger would be the simplest solution.


There is no need to generate dates.


The following query will give you a list of SHORTCODES with no rows at all:

select SHORTCODE from shortcodes
except
select SHORTCODE from VWTBL_INDICATOR

The following query will give you the continuous ranges of MonthYear per SHORTCODE.

select      SHORTCODE
            ,min(MonthYear) as from_MonthYear
            ,max(MonthYear) as to_MonthYear
            ,count(*)       as months

from       (SELECT   SHORTCODE
                    ,MonthYear
                    ,row_number() over (partition by SHORTCODE order by MonthYear)  as rn

            From     VWTBL_INDICATOR
            ) t

group by    SHORTCODE
            ,DATEADD(month,-rn,MonthYear)   

order by    SHORTCODE
            ,from_MonthYear

If you wish you can use the following version which has an additional layer of information:

  • missing_from_MonthYear + to_MonthYear: missing range in the middle
  • ranges: Number of ranges per SHORTCODE (ranges>1 means you have gaps in the middle)
  • range_seq: the sequential number of each SHORTCODE range
  • is_first: Indication for the first range per SHORTCODE (check from_MonthYear to see if you are missing preceding dates)
  • is_last: Indication for the last range per SHORTCODE (check to_MonthYear to see if you are missing following dates)

select      SHORTCODE
           ,from_MonthYear                                                                                  as exists_from_MonthYear
           ,to_MonthYear                                                                                    as exists_to_MonthYear
           ,dateadd (day,1,to_MonthYear)                                                                    as missing_from_MonthYear
           ,dateadd (day,-1,lead (from_MonthYear) over (partition by SHORTCODE order by from_MonthYear))    as missing_to_MonthYear
           ,count       (*) over (partition by SHORTCODE)                                                   as ranges
           ,row_number  ()  over (partition by SHORTCODE order by from_MonthYear)                           as range_seq
           ,case from_MonthYear when min(from_MonthYear) over (partition by SHORTCODE) then 1 end           as is_first
           ,case to_MonthYear   when max(to_MonthYear)   over (partition by SHORTCODE) then 1 end           as is_last

from       (select      SHORTCODE
                       ,min(MonthYear)  as from_MonthYear
                       ,max(MonthYear)  as to_MonthYear
                       ,count(*)        as months

            from       (SELECT      SHORTCODE
                                   ,MonthYear
                                   ,row_number() over (partition by SHORTCODE order by MonthYear)   as rn

                        From        VWTBL_INDICATOR
                        ) t

            group by    SHORTCODE
                       ,DATEADD(month,-rn,MonthYear)    
            ) t

order by    SHORTCODE
           ,from_MonthYear