Strange behaviour with sample sizes for statistics updates

Background

Data for the statistics object are gathered using a statement of the form:

SELECT 
    StatMan([SC0], [SC1], [SB0000]) 
FROM 
(
    SELECT TOP 100 PERCENT 
        [SC0], [SC1], STEP_DIRECTION([SC0]) OVER (ORDER BY NULL) AS [SB0000]
    FROM 
    (
        SELECT 
            [TextValue] AS [SC0], 
            [Id] AS [SC1] 
        FROM [dbo].[Test] 
            TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) 
            WITH (READUNCOMMITTED) 
    ) AS _MS_UPDSTATS_TBL_HELPER 
    ORDER BY 
        [SC0], 
        [SC1], 
        [SB0000] 
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)

You can collect this statement with Extended Events or Profiler (SP:StmtCompleted).

Statistics generation queries often access the base table (rather than a nonclustered index) to avoid the clustering of values that naturally occurs on nonclustered index pages.

The number of rows sampled depends on the number of whole pages selected for sampling. Each page of the table is either selected or it is not. All rows on selected pages contribute to the statistics.

Random numbers

SQL Server uses a random number generator to decide if a page qualifies or not. The generator used in this instance is the Lehmer random number generator with parameter values as shown below:

Xnext = Xseed * 75 mod (231 - 1)

The value of Xseed is computed as the sum of:

  • The low integer part of the (bigint) base table's partition_id e.g.

    SELECT
        P.[partition_id] & 0xFFFFFFFF
    FROM sys.partitions AS P
    WHERE
        P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND P.index_id = 1;
    
  • The value specified in the REPEATABLE clause

    • For sampled UPDATE STATISTICS, the REPEATABLE value is 1.
    • This value is exposed in the m_randomSeed element of the access method's internal debugging information shown in execution plans when trace flag 8666 is enabled, for example <Field FieldName="m_randomSeed" FieldValue="1" />

For SQL Server 2012, this calculation occurs in sqlmin!UnOrderPageScanner::StartScan:

mov     edx,dword ptr [rcx+30h]
add     edx,dword ptr [rcx+2Ch]

where memory at [rcx+30h] contains the low 32 bits of the partition id and memory at [rcx+2Ch] contains the REPEATABLE value in use.

The random number generator is initialized later in the same method, calling sqlmin!RandomNumGenerator::Init, where the instruction:

imul    r9d,r9d,41A7h

...multiplies the seed by 41A7 hex (16807 decimal = 75) as shown in the equation above.

Later random numbers (for individual pages) are generated using the same basic code inlined into sqlmin!UnOrderPageScanner::SetupSubScanner.

StatMan

For the example StatMan query shown above, the same pages will be collected as for the T-SQL statement:

SELECT 
    COUNT_BIG(*) 
FROM dbo.Test AS T 
    TABLESAMPLE SYSTEM (2.223684e+001 PERCENT)  -- Same sample %
    REPEATABLE (1)                              -- Always 1 for statman
    WITH (INDEX(0));                            -- Scan base object

This will match the output of:

SELECT 
    DDSP.rows_sampled
FROM sys.stats AS S
CROSS APPLY sys.dm_db_stats_properties(S.[object_id], S.stats_id) AS DDSP
WHERE 
    S.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    AND S.[name] = N'IX_Test_TextValue';

Edge case

One consequence of using the MINSTD Lehmer random number generator is that seed values zero and int.max should not be used as this will result in the algorithm producing a sequence of zeroes (selecting every page).

The code detects zero, and uses a value from the system 'clock' as the seed in that case. It does not do the same if the seed is int.max (0x7FFFFFFF = 231 - 1).

We can engineer this scenario since the initial seed is calculated as the sum of the low 32 bits of the partition id and the REPEATABLE value. The REPEATABLE value that will result in the seed being int.max and therefore every page being selected for sample is:

SELECT
    0x7FFFFFFF - (P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
    P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    AND P.index_id = 1;

Working that into a complete example:

DECLARE @SQL nvarchar(4000) = 
    N'
    SELECT
        COUNT_BIG(*) 
    FROM dbo.Test AS T 
        TABLESAMPLE (0 PERCENT) 
        REPEATABLE (' +
        (
            SELECT TOP (1)
                CONVERT(nvarchar(11), 0x7FFFFFFF - P.[partition_id] & 0xFFFFFFFF)
            FROM sys.partitions AS P
            WHERE
                P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
                AND P.index_id = 1
        ) + ')
        WITH (INDEX(0));';

PRINT @SQL;
--EXECUTE (@SQL);

That will select every row on every page whatever the TABLESAMPLE clause says (even zero percent).


This is an excellent question! I'll start with what I know for sure and then move onto speculation. Lots of details about this in my blog post here.

Sampled stats updates use TABLESAMPLE behind the scenes. It's pretty easy to find documentation about that online. However, I believe it's not well known that the rows returned by TABLESAMPLE partially depend on the hobt_id of the object. When you drop and recreate the object you get a new hobt_id so the rows returned by random sampling are different.

If you delete and reinsert the data the hobt_id stays the same. As long as the data is laid out in the same way on disk (an allocation order scan returns the same results in the same order) then the sampled data should not change.

You can also change the number of rows sampled by rebuilding the clustered index on the table. For example:

UPDATE STATISTICS dbo.Test IX_Test_TextValue;

DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER; -- 273862 rows

ALTER INDEX PK_Test on Test REBUILD;

UPDATE STATISTICS dbo.Test IX_Test_TextValue;

DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER; -- 273320 rows

As to why that happens, I believe it's because SQL Server scans the clustered index instead of the nonclustered index when gathering sampled stats on an index. I also think that there's a hidden (to those of us tracing the hidden stats update queries) value for REPEATABLE used with TABLESAMPLE. I haven't proven any of that, but it explains why your histogram and the rows sampled changes with a rebuild of the clustered index.


I forgot how TABLESAMPLE worked in terms of assigning a random probability per page. - Martin Smith

I saw this in Inside Microsoft SQL Server 2008: T-SQL Querying by Itzik Ben-Gan and I cannot add it as a comment so I post it here, I think it's interesting to others too:

enter image description here

See also Sampling Using TABLESAMPLE by Roji. P. Thomas.