Inaccurate 'Actual' row counts in parallel plan

Rows are passed across exchanges internally from producer to consumer thread in packets (hence CXPACKET - class exchange packet), rather than a row-at-a-time. There is a certain amount of buffering inside the exchange. Also, the call to shut down the pipeline from the consumer side of the Gather Streams has to be passed in a control packet back to the producer threads. Scheduling and other internal considerations mean that parallel plans always have a certain 'stopping distance'.

As a consequence, you will often see this sort of row count difference where less than the whole potential rowset of a sub-tree is actually required. In this case, the TOP brings execution to an 'early end'.

More information:

  • Parallel Execution Plans Suck
  • Understanding and Using Parallelism in SQL Server.

I think I may have a partial explanation for this but please feel free to shoot it down or post any alternatives. @MartinSmith is definitely on to something by highlighting the effect of TOP in the execution plan.

Put simply, ‘Actual Row Count’ is not a count of the rows an operator processes, it is the number of times the GetNext() method of the operator is called.

Taken from BOL:

The physical operators initialize, collect data, and close. Specifically, the physical operator can answer the following three method calls:

  • Init(): The Init() method causes a physical operator to initialize itself and set up any required data structures. The physical operator may receive many Init() calls, though typically a physical operator receives only one.
  • GetNext(): The GetNext() method causes a physical operator to get the first, or subsequent row of data. The physical operator may receive zero or many GetNext() calls.
  • Close(): The Close() method causes a physical operator to perform some clean-up operations and shut itself down. A physical operator only receives one Close() call.

The GetNext() method returns one row of data, and the number of times it is called appears as ActualRows in the Showplan output that is produced by using SET STATISTICS PROFILE ON or SET STATISTICS XML ON.

For the sake of completeness, a little background on the parallel operators is useful. Work is distributed to multiple streams in a parallel plan by the repartition stream or distribute stream operators. These distribute rows or pages between threads using one of four mechanisms:

  • Hash distributes rows based on a hash of the columns in the row
  • Round-robin distributes rows by iterating through the list of threads in a loop
  • Broadcast distributes all pages or rows to all threads
  • Demand partitioning is used only for scans. Threads spin up, request a page of data from the operator, processes it and requests a further page when done.

The first distribute stream operator (right most in the plan) uses demand partitioning on the rows originating from a constant scan. There are three threads which call GetNext() 6, 4 and 0 times for a total of 10 ‘Actual Rows’:

<RunTimeInformation>
       <RunTimeCountersPerThread Thread="2" ActualRows="6" ActualEndOfScans="1" ActualExecutions="1" />
       <RunTimeCountersPerThread Thread="1" ActualRows="4" ActualEndOfScans="1" ActualExecutions="1" />
       <RunTimeCountersPerThread Thread="0" ActualRows="0" ActualEndOfScans="0" ActualExecutions="0" />
 </RunTimeInformation>

At the next distribution operator we have three threads again, this time with 50, 50 and 0 calls to GetNext() for a total 100:

<RunTimeInformation>
    <RunTimeCountersPerThread Thread="2" ActualRows="50" ActualEndOfScans="1" ActualExecutions="1" />
    <RunTimeCountersPerThread Thread="1" ActualRows="50" ActualEndOfScans="1" ActualExecutions="1" />
    <RunTimeCountersPerThread Thread="0" ActualRows="0" ActualEndOfScans="0" ActualExecutions="0" />
</RunTimeInformation>

It’s at the next parallel operator that the cause and explanation possibly appears.

<RunTimeInformation>
    <RunTimeCountersPerThread Thread="2" ActualRows="1" ActualEndOfScans="0" ActualExecutions="1" />
    <RunTimeCountersPerThread Thread="1" ActualRows="10" ActualEndOfScans="0" ActualExecutions="1" />
    <RunTimeCountersPerThread Thread="0" ActualRows="0" ActualEndOfScans="0" ActualExecutions="0" />
</RunTimeInformation>

So we now have 11 calls to GetNext(), where we were expecting to see 10.

Edit: 2011-11-13

Stuck at this point, I went hawking for answers with the chaps in the clustered index and @MikeWalsh kindly directed @SQLKiwi here.


1,004,588 is a figure that crops up a lot in my testing as well.

I also see this for the somewhat simpler plan below.

WITH 
E1(N) AS 
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)                                       -- 1*10^1 or 10 rows
, E2(N) AS (SELECT 1 FROM E1 a, E1 b)   -- 1*10^2 or 100 rows
, E4(N) AS (SELECT 1 FROM E2 a, E2 b)   -- 1*10^4 or 10,000 rows
SELECT * INTO #E4 FROM E4;

WITH E8(N) AS (SELECT 1 FROM #E4 a, #E4 b),
Nums(N) AS (SELECT  TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM E8 )
SELECT COUNT(N) FROM Nums

DROP TABLE #E4

Plan

Other figures of interest in the execution plan are

+----------------------------------+--------------+--------------+-----------------+
|                                  | Table Scan A | Table Scan B | Row Count Spool |
+----------------------------------+--------------+--------------+-----------------+
| Number Of Executions             | 2            |            2 |             101 |
| Actual Number Of Rows - Total    | 101          |        20000 |         1004588 |
| Actual Number Of Rows - Thread 0 | -            |              |                 |
| Actual Number Of Rows - Thread 1 | 95           |        10000 |          945253 |
| Actual Number Of Rows - Thread 2 | 6            |        10000 |           59335 |
| Actual Rebinds                   | 0            |            0 |               2 |
| Actual Rewinds                   | 0            |            0 |              99 |
+----------------------------------+--------------+--------------+-----------------+

My guess is just that because the tasks are being processed in parallel one task is in mid flight processing rows when the other one delivers the millionth row to the gather streams operator so additional rows are being handled. Additionally from this article the rows are buffered and delivered in batches to this iterator so it seems quite likely that the number of rows being processed would exceed rather than exactly hit the TOP specification in any event.

Edit

Just looking at this in a bit more detail. I noticed I was getting more variety than just the 1,004,588 row count quoted above so ran the query above in a loop for 1,000 iterations and captured the actual execution plans. Discarding the 81 results for which the degree of parallelism was zero gave the following figures.

count       Table Scan A: Total Actual Row Spool - Total Actual Rows
----------- ------------------------------ ------------------------------
352         101                            1004588
323         102                            1004588
72          101                            1003565
37          101                            1002542
35          102                            1003565
29          101                            1001519
18          101                            1000496
13          102                            1002542
5           9964                           99634323
5           102                            1001519
4           9963                           99628185
3           10000                          100000000
3           9965                           99642507
2           9964                           99633300
2           9966                           99658875
2           9965                           99641484
1           9984                           99837989
1           102                            1000496
1           9964                           99637392
1           9968                           99671151
1           9966                           99656829
1           9972                           99714117
1           9963                           99629208
1           9985                           99847196
1           9967                           99665013
1           9965                           99644553
1           9963                           99623626
1           9965                           99647622
1           9966                           99654783
1           9963                           99625116

It can be seen that 1,004,588 was by far the most common result but that on 3 occasions the worst possible case occurred and 100,000,000 rows were processed. The best case observed was 1,000,496 row count, which occurred 19 times.

The Full script to reproduce is at the bottom of revision 2 of this answer (it will need tweaking if run on a system with more than 2 processors).

Tags:

Sql Server