Why can it take up to 30 seconds to create a simple CCI rowgroup?

In many respects, this is expected behaviour. Any set of compression routines will have widely ranging performance depending on input data distribution. We expect to trade data loading speed for storage size and runtime querying performance.

There is a definite limit to how detailed an answer you're going to get here, since VertiPaq is a proprietary implementation, and the details are a closely-guarded secret. Even so, we do know that VertiPaq contains routines for:

  • Value encoding (scaling and/or translating values to fit in a small number of bits)
  • Dictionary encoding (integer references to unique values)
  • Run Length Encoding (storing runs of repeated values as [value, count] pairs)
  • Bit-packing (storing the stream in as few bits as possible)

Typically, data will be value or dictionary encoded, then RLE or bit-packing will be applied (or a hybrid of RLE and bit-packing used on different subsections of the segment data). The process of deciding which techniques to apply can involve generating a histogram to help determine how maximum bit savings can be achieved.

Capturing the slow case with Windows Performance Recorder and analyzing the result with Windows Performance Analyzer, we can see that the vast majority of the execution time is consumed in looking at the clustering of the data, building histograms, and deciding how to partition it for best savings:

WPA Analysis

The most expensive processing occurs for values that appear at least 64 times in the segment. This is a heuristic to determine when pure RLE is likely to be beneficial. The faster cases result in impure storage e.g. a bit-packed representation, with a larger final storage size. In the hybrid cases, values with 64 or more repetitions are RLE encoded, and the remainder are bit-packed.

The longest duration occurs when the maximum number of distinct values with 64 repetitions appear in the largest possible segment i.e. 1,048,576 rows with 16,384 sets of values with 64 entries each. Inspection of the code reveals a hard-coded time limit for the expensive processing. This can be configured in other VertiPaq implementations e.g. SSAS, but not in SQL Server as far as I can tell.

Some insight into the final storage arrangement can be acquired using the undocumented DBCC CSINDEX command. This shows the RLE header and array entries, any bookmarks into the RLE data, and a brief summary of the bit-pack data (if any).

For more information, see:

  • The VertiPaq Engine in DAX by Alberto Ferrari and Marco Russo
  • Microsoft Patent WO2015038442: Processing datasets with a DBMS engine
  • Microsoft Patent WO2010039898: Efficient large-scale filtering and/or sorting for querying of column based data encoded structures

I can't say exactly why this behavior is occurring but I believe I've developed a good model of the behavior via brute force testing. The following conclusions only apply when loading data into a single column and with integers that are very well distributed.

First I tried varying the number of rows inserted into the CCI using TOP. I used ID % 16000 for all tests. Below is a graph comparing rows inserted to the compressed rowgroup segment size:

graph of top vs size

Below is a graph of rows inserted to CPU time in ms. Note that the X-axis has a different starting point:

top vs cpu

We can see that the rowgroup segment size grows at a linear rate and uses a small amount of CPU up until around 1 M rows. At that point the rowgroup size dramatically decreases and CPU usage dramatically increases. It would appear that we pay a heavy price in CPU for that compression.

When inserting less than 1024000 rows I ended up with an open rowgroup in the CCI. However, forcing compression using REORGANIZE or REBUILD did not have an effect on the size. As an aside, I found it interesting that when I used a variable for TOP I ended up with an open rowgroup but with RECOMPILE I ended up with a closed rowgroup.

Next I tested by varying the modulus value while keeping the number of rows the same. Here is a sample of the data when inserting 102400 rows:

╔═══════════╦═════════╦═══════════════╦═════════════╗
║ TOP_VALUE ║ MOD_NUM ║ SIZE_IN_BYTES ║ CPU_TIME_MS ║
╠═══════════╬═════════╬═══════════════╬═════════════╣
║    102400 ║    1580 ║         13504 ║         352 ║
║    102400 ║    1590 ║         13584 ║         316 ║
║    102400 ║    1600 ║         13664 ║         317 ║
║    102400 ║    1601 ║         19624 ║         270 ║
║    102400 ║    1602 ║         25568 ║         283 ║
║    102400 ║    1603 ║         31520 ║         286 ║
║    102400 ║    1604 ║         37464 ║         288 ║
║    102400 ║    1605 ║         43408 ║         273 ║
║    102400 ║    1606 ║         49360 ║         269 ║
║    102400 ║    1607 ║         55304 ║         265 ║
║    102400 ║    1608 ║         61256 ║         262 ║
║    102400 ║    1609 ║         67200 ║         255 ║
║    102400 ║    1610 ║         73144 ║         265 ║
║    102400 ║    1620 ║        132616 ║         132 ║
║    102400 ║    1621 ║        138568 ║         100 ║
║    102400 ║    1622 ║        144512 ║          91 ║
║    102400 ║    1623 ║        150464 ║          75 ║
║    102400 ║    1624 ║        156408 ║          60 ║
║    102400 ║    1625 ║        162352 ║          47 ║
║    102400 ║    1626 ║        164712 ║          41 ║
╚═══════════╩═════════╩═══════════════╩═════════════╝

Up until a mod value of 1600 the rowgroup segment size increases linearly by 80 bytes for each additional 10 unique values. It's an interesting coincidence that a BIGINT traditionally takes up 8 bytes and the segment size increases by 8 bytes for each additional unique value. Past a mod value of 1600 the segment size increases rapidly until it stabilizes.

It's also helpful to look at the data when leaving the modulus value the same and changing the number of rows inserted:

╔═══════════╦═════════╦═══════════════╦═════════════╗
║ TOP_VALUE ║ MOD_NUM ║ SIZE_IN_BYTES ║ CPU_TIME_MS ║
╠═══════════╬═════════╬═══════════════╬═════════════╣
║    300000 ║    5000 ║        600656 ║         131 ║
║    305000 ║    5000 ║        610664 ║         124 ║
║    310000 ║    5000 ║        620672 ║         127 ║
║    315000 ║    5000 ║        630680 ║         132 ║
║    320000 ║    5000 ║         40688 ║        2344 ║
║    325000 ║    5000 ║         40696 ║        2577 ║
║    330000 ║    5000 ║         40704 ║        2589 ║
║    335000 ║    5000 ║         40712 ║        2673 ║
║    340000 ║    5000 ║         40728 ║        2715 ║
║    345000 ║    5000 ║         40736 ║        2744 ║
║    350000 ║    5000 ║         40744 ║        2157 ║
╚═══════════╩═════════╩═══════════════╩═════════════╝

It looks like when the inserted number of rows < ~64 * the number of unique values we see relatively poor compression (2 bytes per row for mod <= 65000) and low, linear CPU usage. When the inserted number of rows > ~64 * the number of unique values we see much better compression and higher, still linear CPU usage. There's a transition between the two states which isn't easy for me to model but it can be seen in the graph. It doesn't appear to be true that we see the maximum CPU usage when inserting exactly 64 rows for each unique value. Rather, we can only insert a maximum of 1048576 rows into a rowgroup and we see much higher CPU usage and compression once there are more than 64 rows per unique value.

Below is a contour plot of how cpu time changes as the number of inserted rows and the number of unique rows changes. We can see the patterns described above:

contour cpu

Below is a contour plot of space used by the segment. After a certain point we start to see much better compression, as described above:

contour size

It seems like there are at least two different compression algorithms at work here. Given the above, it makes sense that we would see the maximum CPU usage when inserting 1048576 rows. It also makes sense that we see the most CPU usage at that point when inserting around 16000 rows. 1048576 / 64 = 16384.

I uploaded all of my raw data here in case someone wants to analyze it.

It's worth mentioning what happens with parallel plans. I only observed this behavior with evenly distributed values. When doing a parallel insert there's often an element of randomness and threads are usually unbalanced.

Put 2097152 rows in the staging table:

DROP TABLE IF EXISTS STG_2097152;
CREATE TABLE dbo.STG_2097152 (ID BIGINT NOT NULL);
INSERT INTO dbo.STG_2097152 WITH (TABLOCK)
SELECT TOP (2097152) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

This insert finishes in less than a second and has poor compression:

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT ID % 16000
FROM dbo.STG_2097152 
OPTION (MAXDOP 2);

We can see the effect of the unbalanced threads:

╔════════════╦════════════╦══════════════╦═══════════════╗
║ state_desc ║ total_rows ║ deleted_rows ║ size_in_bytes ║
╠════════════╬════════════╬══════════════╬═══════════════╣
║ OPEN       ║      13540 ║            0 ║        311296 ║
║ COMPRESSED ║    1048576 ║            0 ║       2095872 ║
║ COMPRESSED ║    1035036 ║            0 ║       2070784 ║
╚════════════╩════════════╩══════════════╩═══════════════╝

There are various tricks that we can do to force the threads to be balanced and to have the same distribution of rows. Here is one of them:

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT FLOOR(0.5 * ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))  % 15999
FROM dbo.STG_2097152
OPTION (MAXDOP 2)

Choosing an odd number for the modulus is important here. SQL Server scans the staging table in serial, calculates the row number, then uses round robin distribution to put the rows on parallel threads. That means that we'll end up with perfectly balanced threads.

balance 1

The insert takes around 40 seconds which is similar to the serial insert. We get nicely compressed rowgroups:

╔════════════╦════════════╦══════════════╦═══════════════╗
║ state_desc ║ total_rows ║ deleted_rows ║ size_in_bytes ║
╠════════════╬════════════╬══════════════╬═══════════════╣
║ COMPRESSED ║    1048576 ║            0 ║        128568 ║
║ COMPRESSED ║    1048576 ║            0 ║        128568 ║
╚════════════╩════════════╩══════════════╩═══════════════╝

We can get the same results by inserting data from the original staging table:

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT t.ID % 16000 ID
FROM  (
    SELECT TOP (2) ID 
    FROM (SELECT 1 ID UNION ALL SELECT 2 ) r
) s
CROSS JOIN dbo.STG_1048576 t
OPTION (MAXDOP 2, NO_PERFORMANCE_SPOOL);

Here round robin distribution is used for the derived table s so one scan of the table is done on each parallel thread:

balanced 2

In conclusion, when inserting evenly distributed integers you can see very high compression when each unique integer appears more than 64 times. This may be due to a different compression algorithm being used. There can be a high cost in CPU to achieve this compression. Small changes in the data can lead to dramatic differences in the size of the compressed rowgroup segment. I suspect that seeing the worst case (from a CPU perspective) will be uncommon in the wild, at least for this data set. It's even harder to see when doing parallel inserts.


I believe, that this has to do with the internal optimisations of the compression for the single column tables, and the magic number of the 64 KB occupied by the dictionary.

Example: if you run with MOD 16600, the final result of the Row Group size will be 1.683 MB, while running MOD 17000 will give you a Row Group with the size of 2.001 MB.

Now, take a look at the dictionaries created (you can use my CISL library for that, you will need the function cstore_GetDictionaries, or alternatively go and query sys.column_store_dictionaries DMV):

(MOD 16600) 61 KB

enter image description here

(MOD 17000) 65 KB

enter image description here

Funny thing, if you will add another column to your table, and let's call it REALID :

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, REALID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

Reload the data for the MOD 16600:

TRUNCATE TABLE dbo.CCI_BIGINT;

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT ID % 16600, ID
FROM dbo.STG_1048576
OPTION (MAXDOP 1);

This time the execution will be fast, because the optimiser will decide not to overwork and compress it too far:

select column_id, segment_id, cast(sum(seg.on_disk_size) / 1024. / 1024 as Decimal(8,3) ) as SizeInMB
    from sys.column_store_segments seg
        inner join sys.partitions part
            on seg.hobt_id = part.hobt_id 
    where object_id = object_id('dbo.CCI_BIGINT')
    group by column_id, segment_id;

Even though there will be a small difference between the Row Group sizes, it will negligible (2.000 (MOD 16600) vs 2.001 (MOD 17000))

For this scenario, the dictionary for the MOD 16000 will be bigger than for the first scenario with 1 column (0.63 vs 0.61).