Why does LEN() function badly underestimate cardinality in SQL Server 2014?

For the legacy CE, I see the estimate is for 3.16228 % of the rows – and that is a "magic number" heuristic used for column = literal predicates (there are other heuristics based on predicate construction – but the LEN wrapped around the column for the legacy CE results matches this guess-framework). You can see examples of this on a post on Selectivity Guesses in absence of Statistics by Joe Sack, and Constant-Constant Comparison Estimation by Ian Jose.

-- Legacy CE: 31622.8 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 9481); -- Legacy CE
GO

Now as for the new CE behavior, it looks like this is now visible to the optimizer (which means we can use statistics). I went through the exercise of looking at the calculator output below, and you can look at the associated auto-generation of stats as a pointer:

-- New CE: 1.00007 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 2312 ); -- New CE
GO

-- View New CE behavior with 2363 (for supported option use XEvents)
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  (QUERYTRACEON 2312, QUERYTRACEON 2363, QUERYTRACEON 3604, RECOMPILE); -- New CE
GO

/*
Loaded histogram for column QCOL:
[tempdb].[dbo].[#customers].cust_nbr from stats with id 2
Using ambient cardinality 1e+006 to combine distinct counts:
  999927
 
Combined distinct count: 999927
Selectivity: 1.00007e-006
Stats collection generated:
  CStCollFilter(ID=2, CARD=1.00007)
      CStCollBaseTable(ID=1, CARD=1e+006 TBL: #customers)
 
End selectivity computation
*/
 
EXEC tempdb..sp_helpstats '#customers';


--Check out AVG_RANGE_ROWS values (for example - plenty of ~ 1)
DBCC SHOW_STATISTICS('tempdb..#customers', '_WA_Sys_00000001_B0368087');
--That's my Stats name yours is subject to change

Unfortunately the logic relies on an estimate of the number of distinct values, which is not adjusted for the effect of the LEN function.

Possible workaround

You can get a trie-based estimate under both CE models by rewriting the LEN as a LIKE:

SELECT COUNT_BIG(*)
FROM #customers AS C
WHERE C.cust_nbr LIKE REPLICATE('_', 6);

LIKE plan


Information on Trace Flags used:

  • 2363: shows a lot of information, including statistics being loaded.
  • 3604: prints the output of DBCC commands to the messages tab.

Is there an explanation for the cardinality estimate of 1.0003 for SQL 2014 while SQL 2012 estimates 31,622 rows?

I think @Zane's answer covers this part pretty well.

Is there a good workaround?

You could try creating a Non-Persisted Computed Column for LEN(cust_nbr) and (optionally) create a Non-Clustered Index on that Computed Column. That should get you accurate stats.

I did some testing and here is what I found:

  • Statistics were auto-created on the Non-Persisted Computed Column, when no index was defined on it.
  • Adding the Non-Clustered Index on the Computed Column not only didn't help, it actually hurt performance a little. Slightly higher CPU and elapsed times. Slightly higher estimated cost (whatever that's worth).
  • Making the Computed Column as PERSISTED (no Index) was better than the other two variations. Estimated Rows was more accurate. CPU and elapsed time were better (as expected since it didn't have to calculate anything per-row).
  • I was unable to create a Filtered Index or Filtered Statistics on the Computed Column (due to it being computed), even if it was PERSISTED :-(