Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit?

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL) 

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL) 

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       ) 

enter image description here

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.


Why does this query need a Row Count Spool operator? I don't think it's necessary for correctness, so what specific optimization is it trying to provide?

See Martin's thorough answer for this question. The key point is that if a single row within the NOT IN is NULL, the boolean logic works out such that "the correct response is to return zero rows". The Row Count Spool operator is optimizing this (necessary) logic.

Why does SQL Server estimate that the join to the Row Count Spool operator removes all rows?

Microsoft provides an excellent white paper on the SQL 2014 Cardinality Estimator. In this document, I found the following information:

The new CE assumes that the queried values do exist in the dataset even if the value falls out of the range of the histogram. The new CE in this example uses an average frequency that is calculated by multiplying the table cardinality by the density.

Often, such a change is a very good one; it greatly alleviates the ascending key problem and typically yields a more conservative query plan (higher row estimate) for values that are out-of-range based on the statistics histogram.

However, in this specific case, assuming that a NULL value will be found leads to the assumption that joining to the Row Count Spool will filter out all rows from #potentialNewCustomers. In the case where there in fact is a NULL row, this is a correct estimate (as seen in Martin's answer). However, in the case where there happens not to be a NULL row, the effect can be devastating because SQL Server produces a post-join estimate of 1 row regardless of how many input rows appear. This can lead to very poor join choices in the remainder of the query plan.

Is this a bug in SQL 2014? If so, I'll file in Connect. But I'd like a deeper understanding first.

I think it's in the grey area between a bug and a performance-impacting assumption or limitation of SQL Server's new Cardinality Estimator. However, this quirk can cause substantial regressions in performance relative to SQL 2012 in the specific case of a nullable NOT IN clause that doesn't happen to have any NULL values.

Therefore, I have filed a Connect issue so that the SQL team is aware of the potential implications of this change to the Cardinality Estimator.

Update: We're on CTP3 now for SQL16, and I confirmed that the problem does not occur there.


Martin Smith's answer and your self-answer have addressed all the main points correctly, I just want to emphasise an area for future readers:

So this question is more about understanding this specific query and plan in depth and less about how to phrase the query differently.

The stated purpose of the query is:

-- Prune any existing customers from the set of potential new customers

This requirement is easy to express in SQL, in several ways. Which one is chosen is as much a matter of style as anything else, but the query specification should still be written to return correct results in all cases. This includes accounting for nulls.

Expressing the logical requirement fully:

  • Return potential customers who are not already customers
  • List each potential customer at most once
  • Exclude null potential and existing customers (whatever a null customer means)

We can then write a query matching those requirements using whichever syntax we prefer. For example:

WITH DistinctPotentialNonNullCustomers AS
(
    SELECT DISTINCT 
        PNC.cust_nbr 
    FROM #potentialNewCustomers AS PNC
    WHERE 
        PNC.cust_nbr IS NOT NULL
)
SELECT
    DPNNC.cust_nbr
FROM DistinctPotentialNonNullCustomers AS DPNNC
WHERE
    DPNNC.cust_nbr NOT IN
    (
        SELECT 
            EC.cust_nbr 
        FROM #existingCustomers AS EC 
        WHERE 
            EC.cust_nbr IS NOT NULL
    );

This produces an efficient execution plan, which returns correct results:

Execution plan

We can express the NOT IN as <> ALL or NOT = ANY without affecting the plan or results:

WITH DistinctPotentialNonNullCustomers AS
(
    SELECT DISTINCT 
        PNC.cust_nbr 
    FROM #potentialNewCustomers AS PNC
    WHERE 
        PNC.cust_nbr IS NOT NULL
)
SELECT
    DPNNC.cust_nbr
FROM DistinctPotentialNonNullCustomers AS DPNNC
WHERE
    DPNNC.cust_nbr <> ALL
    (
        SELECT 
            EC.cust_nbr 
        FROM #existingCustomers AS EC 
        WHERE 
            EC.cust_nbr IS NOT NULL
    );
WITH DistinctPotentialNonNullCustomers AS
(
    SELECT DISTINCT 
        PNC.cust_nbr 
    FROM #potentialNewCustomers AS PNC
    WHERE 
        PNC.cust_nbr IS NOT NULL
)
SELECT
    DPNNC.cust_nbr
FROM DistinctPotentialNonNullCustomers AS DPNNC
WHERE
    NOT DPNNC.cust_nbr = ANY
    (
        SELECT 
            EC.cust_nbr 
        FROM #existingCustomers AS EC 
        WHERE 
            EC.cust_nbr IS NOT NULL
    );

Or using NOT EXISTS:

WITH DistinctPotentialNonNullCustomers AS
(
    SELECT DISTINCT 
        PNC.cust_nbr 
    FROM #potentialNewCustomers AS PNC
    WHERE 
        PNC.cust_nbr IS NOT NULL
)
SELECT
    DPNNC.cust_nbr
FROM DistinctPotentialNonNullCustomers AS DPNNC
WHERE 
    NOT EXISTS
    (
        SELECT * 
        FROM #existingCustomers AS EC
        WHERE
            EC.cust_nbr = DPNNC.cust_nbr
            AND EC.cust_nbr IS NOT NULL
    );

There is nothing magic about this, or anything particularly objectionable about using IN, ANY, or ALL - we just need to write the query correctly, so it will always produce the right results.

The most compact form uses EXCEPT:

SELECT 
    PNC.cust_nbr 
FROM #potentialNewCustomers AS PNC
WHERE 
    PNC.cust_nbr IS NOT NULL
EXCEPT
SELECT
    EC.cust_nbr 
FROM #existingCustomers AS EC
WHERE 
    EC.cust_nbr IS NOT NULL;

This produces correct results as well, though the execution plan may be less efficient due to absence of bitmap filtering:

Non-bitmap execution plan

The original question is interesting because it exposes a performance-affecting problem with the necessary null check implementation. The point of this answer is that writing the query correctly avoids the issue as well.