SUM of DATALENGTHs not matching table size from sys.allocation_units

                          Please note that the following info is not intended to be a comprehensive
description of how data pages are laid out, such that one can calculate
the number of bytes used per any set of rows, as that is very complicated.

Data is not the only thing taking up space on an 8k data page:

  • There is reserved space. You are only allowed to use 8060 of the 8192 bytes (that's 132 bytes that were never yours in the first place):

    • Page header: This is exactly 96 bytes.
    • Slot array: this is 2 bytes per row and indicates the offset of where each row starts on the page. The size of this array is not limited to the remaining 36 bytes (132 - 96 = 36), else you would be effectively limited to only putting 18 rows max on a data page. This means that each row is 2 bytes larger than you think it is. This value is not included in the "record size" as reported by DBCC PAGE, which is why it is kept separate here instead of being included in the per-row info below.
    • Per-Row meta-data (including, but not limited to):
      • The size varies depending on the table definition (i.e. number of columns, variable-length or fixed-length, etc). Info taken from @PaulWhite's and @Aaron's comments that can be found in the discussion related to this answer and testing.
      • Row-header: 4 bytes, 2 of them denoting the record type, and the other two being an offset to the NULL Bitmap
      • Number of columns: 2 bytes
      • NULL Bitmap: which columns are currently NULL. 1 byte per each set of 8 columns. And for all columns, even the NOT NULL ones. Hence, minimum 1 byte.
      • Variable-length column offset array: 4 bytes minimum. 2 bytes to hold the number of variable-length columns, and then 2 bytes per each variable-length column to hold the offset to where it starts.
      • Versioning Info: 14 bytes (this will be present if your database is set to either ALLOW_SNAPSHOT_ISOLATION ON or READ_COMMITTED_SNAPSHOT ON).
    • Please see the following Question and Answer for more details on this: Slot Array and Total Page Size
    • Please see the following blog post from Paul Randall which has several interesting details on how the data pages are laid out: Poking about with DBCC PAGE (Part 1 of ?)
  • LOB pointers for data that is not stored in row. So that would account for DATALENGTH + pointer_size. But these are not of a standard size. Please see the following blog post for details on this complex topic: What is the Size of the LOB Pointer for (MAX) Types Like Varchar, Varbinary, Etc?. Between that linked post and some additional testing I have done, the (default) rules should be as follows:

    • Legacy / deprecated LOB types that nobody should be using anymore as of SQL Server 2005 (TEXT, NTEXT, and IMAGE):
      • By default, always store their data on LOB pages and always use a 16-byte pointer to LOB storage.
      • IF sp_tableoption was used to set the text in row option, then:
        • if there is space on the page to store the value, and value is not greater than the max in-row size (configurable range of 24 - 7000 bytes with a default of 256), then it will be stored in-row,
        • else it will be a 16-byte pointer.
    • For the newer LOB types introduced in SQL Server 2005 (VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX)):
      • By default:
        • If the value is not greater than 8000 bytes, and there is room on the page, then it will be stored in-row.
        • Inline Root — for data between 8001 and 40,000 (really 42,000) bytes, space permitting, there will be 1 to 5 pointers (24 - 72 bytes) IN ROW that point directly to the LOB page(s). 24 bytes for the initial 8k LOB page, and 12 bytes per each additional 8k page for up to four more 8k pages.
        • TEXT_TREE — for data over 42,000 bytes, or if the 1 to 5 pointers can't fit in-row, then there will be just a 24 byte pointer to the starting page of a list of pointers to the LOB pages (i.e. the "text_tree" page).
      • IF sp_tableoption was used to set the large value types out of row option, then always use a 16-byte pointer to LOB storage.
    • I said "default" rules because I did not test in-row values against the impact of certain features such as Data Compression, Column-level Encryption, Transparent Data Encryption, Always Encrypted, etc.
  • LOB overflow pages: If a value is 10k, then that will require 1 full 8k page of overflow, and then part of a 2nd page. If no other data can take up the remaining space (or is even allowed to, I am unsure of that rule), then you have approx 6kb of "wasted" space on that 2nd LOB overflow datapage.

  • Unused space: An 8k data page is just that: 8192 bytes. It does not vary in size. The data and meta-data placed on it, however, does not always fit nicely into all 8192 bytes. And rows cannot be split onto multiple data pages. So if you have 100 bytes remaining but no row (or no row that would fit in that location, depending on several factors) can fit there, the data page is still taking up 8192 bytes, and your 2nd query is only counting the number of data pages. You can find this value in two places (just keep in mind that some portion of this value is some amount of that reserved space):

    • DBCC PAGE( db_name, file_id, page_id ) WITH TABLERESULTS; Look for ParentObject = "PAGE HEADER:" and Field = "m_freeCnt". The Value field is the number of unused bytes.
    • SELECT buff.free_space_in_bytes FROM sys.dm_os_buffer_descriptors buff WHERE buff.[database_id] = DB_ID(N'db_name') AND buff.[page_id] = page_id; This is the same value as reported by "m_freeCnt". This is easier than DBCC since it can get many pages, but also requires that the pages have been read into the buffer pool in the first place.
  • Space reserved by FILLFACTOR < 100. Newly created pages do not respect the FILLFACTOR setting, but doing a REBUILD will reserve that space on each data page. The idea behind the reserved space is that it will be used by non-sequential inserts and/or updates that expand the size of rows on the page already, due to variable length columns being updated with slightly more data (but not enough to cause a page-split). But you could easily reserve space on data pages that would naturally never get new rows and never have the existing rows updated, or at least not updated in a way that would increase the size of the row.

  • Page-Splits (fragmentation): Needing to add a row to a location that has no room for the row will cause a page split. In this case, approx 50% of the existing data is moved to a new page and the new row is added to one of the 2 pages. But you now have a bit more free space that is not accounted for by DATALENGTH calculations.

  • Rows marked for deletion. When you delete rows, they are not always immediately removed from the data page. If they cannot be removed immediately, they are "marked for death" (Steven Segal reference) and will be physically removed later by the ghost cleanup process (I believe that is the name). However, these might not be relevant to this particular Question.

  • Ghost pages? Not sure if that is the proper term, but sometimes data pages do not get removed until a REBUILD of the Clustered Index is done. That would also account for more pages than DATALENGTH would add up to. This generally should not happen, but I have run into it once, several years ago.

  • SPARSE columns: Sparse columns save space (mostly for fixed-length datatypes) in tables where a large % of the rows are NULL for one or more columns. The SPARSE option makes the NULL value type up 0 bytes (instead of the normal fixed-length amount, such as 4 bytes for an INT), but, non-NULL values each take up an additional 4 bytes for fixed-length types and a variable amount for variable-length types. The issue here is that DATALENGTH does not include the extra 4 bytes for non-NULL values in a SPARSE column, so those 4 bytes need to be added back in. You can check to see if there are any SPARSE columns via:

    SELECT OBJECT_SCHEMA_NAME(sc.[object_id]) AS [SchemaName],
           OBJECT_NAME(sc.[object_id]) AS [TableName],
           sc.name AS [ColumnName]
    FROM   sys.columns sc
    WHERE  sc.is_sparse = 1;
    

    And then for each SPARSE column, update the original query to use:

    SUM(DATALENGTH(FieldN) + 4)
    

    Please note that the calculation above to add in a standard 4 bytes is a bit simplistic as it only works for fixed-length types. AND, there is additional meta-data per row (from what I can tell so far) that reduces the space available for data, simply by having at least one SPARSE column. For more details, please see the MSDN page for Use Sparse Columns.

  • Index and other (e.g. IAM, PFS, GAM, SGAM, etc) pages: these are not "data" pages in terms of user data. These will inflate the total size of the table. If using SQL Server 2012 or newer, you can use the sys.dm_db_database_page_allocations Dynamic Management Function (DMF) to see the page types (earlier versions of SQL Server can use DBCC IND(0, N'dbo.table_name', 0);):

    SELECT *
    FROM   sys.dm_db_database_page_allocations(
                   DB_ID(),
                   OBJECT_ID(N'dbo.table_name'),
                   1,
                   NULL,
                   N'DETAILED'
                  )
    WHERE  page_type = 1; -- DATA_PAGE
    

    Neither the DBCC IND nor sys.dm_db_database_page_allocations (with that WHERE clause) will report any Index pages, and only the DBCC IND will report at least one IAM page.

  • DATA_COMPRESSION: If you have ROW or PAGE Compression enabled on the Clustered Index or Heap, then you can forget about most of what has been mentioned so far. The 96 byte Page Header, 2 bytes-per-row Slot Array, and 14 bytes-per-row Versioning Info are still there, but the physical representation of the data becomes highly complex (much more so than what has already been mentioned when Compression is not being used). For example, with Row Compression, SQL Server attempts to use the smallest possible container to fit each column, per each row. So if you have a BIGINT column which would otherwise (assuming SPARSE is also not enabled) always take up 8 bytes, if the value is between -128 and 127 (i.e. signed 8-bit integer) then it will use just 1 byte, and if the value could fit into a SMALLINT, it will only take up 2 bytes. Integer types that are either NULL or 0 take up no space and are simply indicated as being NULL or "empty" (i.e. 0) in an array mapping out the columns. And there are many, many other rules. Have Unicode data (NCHAR, NVARCHAR(1 - 4000), but not NVARCHAR(MAX), even if stored in-row)? Unicode Compression was added in SQL Server 2008 R2, but there is no way to predict the outcome of the "compressed" value in all situations without doing the actual compression given the complexity of the rules.

So really, your second query, while more accurate in terms of total physical space taken up on disk, is only really accurate upon doing a REBUILD of the Clustered Index. And after that, you still need to account for any FILLFACTOR setting below 100. And even then there are always page headers, and often enough some amount of "wasted" space that is simply not fillable due to being too small to fit any row in this table, or at least the row that logically should go in that slot.

Regarding the accuracy of the 2nd query in determining "data usage", it seems most fair to back-out the Page Header bytes since they are not data usage: they are cost-of-business overhead. If there is 1 row on a data page and that row is just a TINYINT, then that 1 byte still required that the data page existed and hence the 96 bytes of the header. Should that 1 department get charged for the entire data page? If that data page is then filled up by Department #2, would they evenly split that "overhead" cost or pay proportionally? Seems easiest to just back it out. In which case, using a value of 8 to multiply against number of pages is too high. How about:

-- 8192 byte data page - 96 byte header = 8096 (approx) usable bytes.
SELECT 8060.0 / 1024 -- 7.906250

Hence, use something like:

(SUM(a.total_pages) * 7.91) / 1024 AS [TotalSpaceMB]

for all calculations against "number_of_pages" columns.

AND, considering that using DATALENGTH per each field cannot return the per-row meta-data, that should be added to your per-table query where you get the DATALENGTH per each field, filtering on each "department":

  • Record Type and offset to NULL Bitmap: 4 bytes
  • Column Count: 2 bytes
  • Slot Array: 2 bytes (not included in "record size" but still need to account for)
  • NULL Bitmap: 1 byte per every 8 columns (for all columns)
  • Row Versioning: 14 bytes (if database has either ALLOW_SNAPSHOT_ISOLATION or READ_COMMITTED_SNAPSHOT set to ON)
  • Variable-length column Offset Array: 0 bytes if all columns are fixed-length. If any columns are variable-length, then 2 bytes, plus 2 bytes per each of only the variable-length columns.
  • LOB pointers: this part is very imprecise since there won't be a pointer if the value is NULL, and if the value fits on the row then it can be much smaller or much larger than the pointer, and if the value is stored off-row, then the size of the pointer might depend on how much data there is. However, since we just want an estimate (i.e. "swag"), it seems like 24 bytes is a good value to use (well, as good as any other ;-). This is per-each MAX field.

Hence, use something like:

  • In general (row header + number of columns + slot array + NULL bitmap):

    ([RowCount] * (( 4 + 2 + 2 + (1 + (({NumColumns} - 1) / 8) ))
    
  • In general (auto-detect if "version info" is present):

    + (SELECT CASE WHEN snapshot_isolation_state = 1 OR is_read_committed_snapshot_on = 1
                     THEN 14 ELSE 0 END FROM sys.databases WHERE [database_id] = DB_ID())
    
  • IF there are any variable-length columns, then add:

    + 2 + (2 * {NumVariableLengthColumns})
    
  • IF there are any MAX / LOB columns, then add:

    + (24 * {NumLobColumns})
    
  • In general:

    )) AS [MetaDataBytes]
    

This isn't exact, and again will not work if you have Row or Page Compression enabled on the Heap or Clustered Index, but should definitely get you closer.


UPDATE Regarding the 15% Difference Mystery

We (myself included) were so focused on thinking about how data pages are laid out and how DATALENGTH might account for things that we didn't spend a lot of time reviewing the 2nd query. I ran that query against a single table and then compared those values to what was being reported by sys.dm_db_database_page_allocations and they were not the same values for the number of pages. On a hunch, I removed the aggregate functions and GROUP BY, and replaced the SELECT list with a.*, '---' AS [---], p.*. And then it became clear: people must be careful where on these murky interwebs they get their info and scripts from ;-). The 2nd query posted in the Question is not exactly correct, especially for this particular Question.

  • Minor problem: outside of it not making much sense to GROUP BY rows (and not have that column in an aggregate function), the JOIN between sys.allocation_units and sys.partitions isn't technically correct. There are 3 types of Allocation Units, and one of them should JOIN to a different field. Quite often partition_id and hobt_id are the same, so there might never be a problem, but sometimes those two fields do have different values.

  • Major problem: the query uses the used_pages field. That field covers all types of pages: Data, Index, IAM, etc, tc. There is another, more appropriate field to use when concerned with only the actual data: data_pages.

I adapted the 2nd query in the Question with the above items in mind, and using the data page size that backs out the page header. I also removed two JOINs that were unnecessary: sys.schemas (replaced with call to SCHEMA_NAME()), and sys.indexes (the Clustered Index is always index_id = 1 and we have index_id in sys.partitions).

SELECT  SCHEMA_NAME(st.[schema_id]) AS [SchemaName],
        st.[name] AS [TableName],
        SUM(sp.[rows]) AS [RowCount],
        (SUM(sau.[total_pages]) * 8.0) / 1024 AS [TotalSpaceMB],
        (SUM(CASE sau.[type]
           WHEN 1 THEN sau.[data_pages]
           ELSE (sau.[used_pages] - 1) -- back out the IAM page
         END) * 7.91) / 1024 AS [TotalActualDataMB]
FROM        sys.tables st
INNER JOIN  sys.partitions sp
        ON  sp.[object_id] = st.[object_id]
INNER JOIN  sys.allocation_units sau
        ON  (   sau.[type] = 1
            AND sau.[container_id] = sp.[partition_id]) -- IN_ROW_DATA
        OR  (   sau.[type] = 2
            AND sau.[container_id] = sp.[hobt_id]) -- LOB_DATA
        OR  (   sau.[type] = 3
            AND sau.[container_id] = sp.[partition_id]) -- ROW_OVERFLOW_DATA
WHERE       st.is_ms_shipped = 0
--AND         sp.[object_id] = OBJECT_ID(N'dbo.table_name')
AND         sp.[index_id] < 2 -- 1 = Clustered Index; 0 = Heap
GROUP BY    SCHEMA_NAME(st.[schema_id]), st.[name]
ORDER BY    [TotalSpaceMB] DESC;

Maybe this is a grunge answer but this is what I would do.

So DATALENGTH only account for 86% of the total. It is still very representative split. The overhead in the excellent answer from srutzky should have a pretty even split.

I would use your second query (pages) for the total. And use the first (datalength) for allocating the split. Many costs are allocated using a normalization.

And you have to consider a closer answer is going to raise cost so even the dept that lost out on a split may still pay more.