Cardinality Estimate for LIKE operator (Local Variables)

The guess for LIKE in your case is based on:

  • G: The standard 9% guess (sqllang!x_Selectivity_Like)
  • M: A factor of 6 (magic number)
  • D: Average data length in bytes (from statistics), rounded down to integer

Specifically, sqllang!CCardUtilSQL7::ProbLikeGuess uses:

Selectivity (S) = G / M * LOG(D)

Notes:

  • The LOG(D) term is omitted if D is between 1 and 2.
  • If D is less than 1 (including for missing or NULL statistics):
    D = FLOOR(0.5 * maximum column byte length)

This sort of quirkiness and complexity is quite typical of the original CE.

In the question example, the average length is 5 (5.6154 from DBCC SHOW_STATISTICS rounded down):

Estimate = 10,000 * (0.09 / 6 * LOG(5)) = 241.416

Other example values:

 D  = Estimate using formula for S
 15 = 406.208
 14 = 395.859
 13 = 384.742
 12 = 372.736
 11 = 359.684
 10 = 345.388
 09 = 329.584
 08 = 311.916
 07 = 291.887
 06 = 268.764
 05 = 241.416
 04 = 207.944
 03 = 164.792
 02 = 150.000 (LOG not used)
 01 = 150.000 (LOG not used)
 00 = 291.887 (LOG 7) /* FLOOR(0.5 * 15) [15 since lastname is varchar(15)] */

Test rig

DECLARE
    @CharLength integer = 5, -- Set length here
    @Counter integer = 1;

CREATE TABLE #T (c1 varchar(15) NULL);

-- Add 10,000 rows
SET NOCOUNT ON;
SET STATISTICS XML OFF;

BEGIN TRANSACTION;
WHILE @Counter <= 10000
BEGIN
    INSERT #T (c1) VALUES (REPLICATE('X', @CharLength));
    SET @Counter = @Counter + 1;
END;
COMMIT TRANSACTION;

SET NOCOUNT OFF;
SET STATISTICS XML ON;

-- Test query
DECLARE @Like varchar(15);
SELECT * FROM #T AS T 
WHERE T.c1 LIKE @Like;

DROP TABLE #T;

I tested on SQL Server 2014 with the legacy CE and did not get 9% as a cardinality estimate either. I couldn't find anything accurate online so I did some testing and I found a model that fits all of the test cases that I tried, but I can't be sure that it's complete.

In the model that I found, the estimate is derived from the number of rows in the table, the average key length of the statistics for the filtered column, and sometimes the datatype length of the filtered column. There are two different formulas used for the estimation.

If FLOOR(average key length) = 0 then the estimation formula ignores the column statistics and creates an estimate based on the datatype length. I only tested with VARCHAR(N) so it's possible that there's a different formula for NVARCHAR(N). Here is the formula for VARCHAR(N):

(row estimate) = (rows in table) * (-0.004869 + 0.032649 * log10(length of data type))

This has a very nice fit, but it's not perfectly accurate:

first formula graph

The x-axis is the length of the data type and the y axis is the number of estimated rows for a table with 1 million rows.

The query optimizer would use this formula if you did not have statistics on the column or if the column has enough NULL values to drive the average key length to below 1.

For example, suppose that you had a table with 150k rows with filtering on a VARCHAR(50) and no column statistics. The row estimate prediction is:

150000 * (-0.004869 + 0.032649 * log10(50)) = 7590.1 rows

SQL to test it:

CREATE TABLE X_CE_LIKE_TEST_1 (
STRING VARCHAR(50)
);

CREATE STATISTICS X_STAT_CE_LIKE_TEST_1 ON X_CE_LIKE_TEST_1 (STRING) WITH NORECOMPUTE;

WITH
    L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
    L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
    L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
    L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
    L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B CROSS JOIN L2 C),
    NUMS AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS NUM FROM L4)  
    INSERT INTO X_CE_LIKE_TEST_1 WITH (TABLOCK) (STRING)
    SELECT TOP (150000) 'ZZZZZ'
    FROM NUMS
    ORDER BY NUM;

DECLARE @LastName VARCHAR(15) = 'BA%'
SELECT * FROM X_CE_LIKE_TEST_1
WHERE STRING LIKE @LastName;

SQL Server gives an estimate row count of 7242.47 which is kind of close.

If FLOOR(average key length) >= 1 then a different formula is used that is based on the value of FLOOR(average key length). Here is a table of some of the values that I tried:

1    1.5%
2    1.5%
3    1.64792%
4    2.07944%
5    2.41416%
6    2.68744%
7    2.91887%
8    3.11916%
9    3.29584%
10   3.45388%

If FLOOR(average key length) < 6 then use the table above. Otherwise use the following equation:

(row estimate) = (rows in table) * (-0.003381 + 0.034539 * log10(FLOOR(average key length)))

This one has a better fit than the other one, but it's still not perfectly accurate.

second formula graph

The x-axis is the average key length and the y axis is the number of estimated rows for a table with 1 million rows.

To give another example, suppose that you had a table with 10k rows with an average key length of 5.5 for the statistics on the filtered column. The row estimate would be:

10000 * 0.241416 = 241.416 rows.

SQL to test it:

CREATE TABLE X_CE_LIKE_TEST_2 (
STRING VARCHAR(50)
);

WITH
    L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
    L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
    L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
    L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
    L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B CROSS JOIN L2 C),
    NUMS AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS NUM FROM L4)  
    INSERT INTO X_CE_LIKE_TEST_2 WITH (TABLOCK) (STRING)
    SELECT TOP (10000) 
    CASE 
      WHEN NUM % 2 = 1 THEN REPLICATE('Z', 5) 
      ELSE REPLICATE('Z', 6)
    END
    FROM NUMS
    ORDER BY NUM;

CREATE STATISTICS X_STAT_CE_LIKE_TEST_2 ON X_CE_LIKE_TEST_2 (STRING) 
WITH NORECOMPUTE, FULLSCAN;

DECLARE @LastName VARCHAR(15) = 'BA%'
SELECT * FROM X_CE_LIKE_TEST_2
WHERE STRING LIKE @LastName;

The row estimate is 241.416 which matches what you have in the question. There would be some error if I used a value not in the table.

The models here aren't perfect but I think that they illustrate the general behavior pretty well.