Index scan of table with one record with 2.2 billion executions

Let's start by looking at the top right of the plan. That part calculates the OperatingDate column:

operating date

Since we get back 1.72 M rows for the outer row set we can expect around 1.72 M index seeks against ix_days. That is indeed what happens. There are 478k rows for which o.[CreationDate] as time) > '16:00:00' so the CASE statement sends 478k seeks to one branch and the rest to the other.

Note that the index that you have isn't the most efficient one possible for this query. We can only do a seek predicate against PKDate. The rest of the filters are applied as a predicate. This means that the seek might traverse many rows before finding a match. I assume that most days in your calendar table aren't weekends or holidays so it may not make a practical difference for this query. However, you could define an index on is_weekend, is_holiday, PKDate. That should let you immediately seek to the first row that you want.

seek versus predicate

To make the point more clear let's go through a simple example:

-- does a scan
SELECT TOP 1 PkDate
FROM [Days]
WHERE is_weekend <> 1 AND is_holiday <> 1
AND PkDate >= '2000-04-01'
ORDER BY PkDate;

-- does a seek, reads 3 rows to return 1
SELECT TOP 1 PkDate
FROM [Days]
WHERE is_weekend = 0 AND is_holiday = 0
AND PkDate >= '2000-04-01'
ORDER BY PkDate;

-- create new index
CREATE NONCLUSTERED INDEX [ix_days_2] ON [dbo].[days]
(
    [is_weekend],
    [is_holiday],
    PkDate
)

-- does a seek, reads 1 row to return 1
SELECT TOP 1 PkDate
FROM [Days]
WHERE is_weekend = 0 AND is_holiday = 0
AND PkDate >= '2000-04-01'
ORDER BY PkDate;

DROP INDEX [days].[ix_days_2];

Let's get to the more interesting part which is the branch to calculate the DeliveryDate column. I'll only include half of it:

DeliveryDate branch

I suspect that what you hoped the optimizer would do is to calculate this as a scalar:

dateadd(day,isnull(
                  (select top 1 [operatingdays]
                  from [dbo].[CS]
                  where DefaultService = 1)
                 ,2)+1,Cast(o.[CreationDate] as date))

And to use the value of that to do an index seek using ix_days. Unfortunately, the optimizer does not do that. It instead applies a row goal against the index and does a scan. For each row returned from the scan it checks to see if the value matches the filter against [dbo].[CS]. The scan stops as soon as it finds one row that matches. SQL Server estimated that it would only pull back 3.33 rows on average from the scan until it found a match. If that were true then you'd see around 1.5 M executions against [dbo].[CS]. Instead the optimizer did 2 billion executions against the table, so the estimate was off by over 1000 times.

As a general rule you should carefully examine any scans on the inner side of a nested loop. Of course, there are some queries for which that is what you want. And just because you have a seek doesn't mean that the query will be efficient. For example, if a seek returns many rows there may not be that much difference from doing a scan. You didn't post the full query here, but I'll go over a few ideas which could help.

This query is a bit odd:

select top 1 [operatingdays]
from [dbo].[CS]
where DefaultService = 1

It is non-deterministic because you have TOP without ORDER BY. However, the table itself has 1 row and you always pull back the same value for each row from o. If possible, I would just try saving off the value of this query into a local variable and using that in the query instead. That should save you a total of 8 billion scans again [dbo].[CS] and I would expect to see an index seek instead of an index scan against ix_days. I was able to mock up some data on my machine. Here is part of the query plan:

good query plan 1

Now we have all seeks and those seeks shouldn't process too many extra rows. However, the real query may be more complicated than that so perhaps you can't use a variable.

Let's say I write a different filter condition that doesn't use TOP. Instead I'll use MIN. SQL Server is able to process that subquery in a more efficient way. TOP can prevent certain query transformations. Here is my subquery:

WHERE PKDate > dateadd(day,isnull(
                      (select MIN([operatingdays])
                       from [dbo].[CS]
                       where DefaultService = 1)
                      ,2), Cast(o.[CreationDate] as date))

Here is what the plan might look like:

good plan 2

Now we'll only do around 1.5 million scans against the CS table. we also get a much more efficient index seek against the ix_days index which is able to use the results of the subquery:

nice seek

Of course, I'm not saying that you should rewrite your code to use that. It'll probably return incorrect results. The important point is that you can get the index seeks that you want with a subquery. You just need to write your subquery in the right way.

For one more example, let's assume that you absolutely need to keep the TOP operator in the subquery. It might be possible to add a redundant filter against PkDate to get better performance. I'm going to assume that the results of the subquery are non-negative and small. That means that this query will be equivalent:

  PKDate > Cast(o.[CreationDate] as date) AND 
  PKDate > dateadd(day,isnull(
      (select top 1 [operatingdays]
      from [dbo].[CS]
      where DefaultService = 1)
     ,2)+1,Cast(o.[CreationDate] as date))

This changes the plan to use seeks:

seeks again

It's important to realize that the seeks may return more just one row. The important point is that SQL Server can start seeking at o.[CreationDate]. If there's a large gap in the dates then the index seek will process many extra rows and the query will not be as efficient.


Now the question is, related to index scans and number of executions: why the 2 BILLION? Or 6 Billion?

You are getting those numbers from nested loop join.

In its simplest form, a nested loops join compares each row from one

table (known as the outer table) to each row from the other table (known as the inner table) looking for rows that satisfy the join predicate. (Note that the terms “inner” and “outer” are overloaded; their meaning must be inferred from context. “Inner table” and “outer table” refer to the inputs to the join. “Inner join” and “outer join” refer to the logical operations.)

We can express the algorithm in pseudo-code as:

for each row R1 in the outer table for each row R2 in the inner table if R1 joins with R2 return (R1, R2)

It’s the nesting of the for loops in this algorithm that gives nested loops join its name.

The total number of rows compared and, thus, the cost of this algorithm is proportional to the size of the outer table multiplied by the size of the inner table. Since this cost grows quickly as the size of the input tables grow, in practice we try to minimize the cost by reducing the number of inner rows that we must consider for each outer row.]1

In your this is one example how you get about 2B records. enter image description here

Another one how you get 5B+. enter image description here

Few links about how you can avoid large nested loop join:

  1. How to optimize a query that's running slow on Nested Loops (Inner Join)
  2. https://stackoverflow.com/questions/28441468/why-are-nested-loops-chosen-causing-long-execution-time-for-self-join
  3. https://www.littlekendra.com/2016/09/06/estimated-vs-actual-number-of-rows-in-nested-loop-operators/