Delete millions of rows from a SQL table

The execution plan shows that it is reading rows from a nonclustered index in some order then performing seeks for each outer row read to evaluate the NOT EXISTS

enter image description here

You are deleting 7.2% of the table. 16,000,000 rows in 3,556 batches of 4,500

Assuming that the rows that qualify are evently distributed throughout the index then this means it will delete approx 1 row every 13.8 rows.

So iteration 1 will read 62,156 rows and perform that many index seeks before it finds 4,500 to delete.

iteration 2 will read 57,656 (62,156 - 4,500) rows that definitely won't qualify ignoring any concurrent updates (as they have already been processed) and then another 62,156 rows to get 4,500 to delete.

iteration 3 will read (2 * 57,656) + 62,156 rows and so on until finally iteration 3,556 will read (3,555 * 57,656) + 62,156 rows and perform that many seeks.

So the number of index seeks performed across all batches is SUM(1, 2, ..., 3554, 3555) * 57,656 + (3556 * 62156)

Which is ((3555 * 3556 / 2) * 57656) + (3556 * 62156) - or 364,652,494,976

I would suggest that you materialise the rows to delete into a temp table first

INSERT INTO #MyTempTable
SELECT MySourceTable.PK,
       1 + ( ROW_NUMBER() OVER (ORDER BY MySourceTable.PK) / 4500 ) AS BatchNumber
FROM   MySourceTable
WHERE  NOT EXISTS (SELECT *
                   FROM   dbo.vendor AS v
                   WHERE  VendorId = v.Id) 

And change the DELETE to delete WHERE PK IN (SELECT PK FROM #MyTempTable WHERE BatchNumber = @BatchNumber) You may still need to include a NOT EXISTS in the DELETE query itself to cater for updates since the temp table was populated but this should be much more efficient as it will only need to perform 4,500 seeks per batch.


The execution plan suggests that each successive loop will do more work than the previous loop. Assuming that the rows to delete are evenly distributed throughout the table the first loop will need to scan about 4500*221000000/16000000 = 62156 rows to find 4500 rows to delete. It will also do the same number of clustered index seeks against the vendor table. However, the second loop will need to read past the same 62156 - 4500 = 57656 rows that you didn't delete the first time. We might expect the second loop to scan 120000 rows from MySourceTable and to do 120000 seeks against the vendor table. The amount of work needed per loop increases at a linear rate. As an approximation we can say that the average loop will need to read 102516868 rows from from MySourceTable and to do 102516868 seeks against the vendor table. To delete 16 million rows with a batch size of 4500 your code needs to do 16000000/4500 = 3556 loops, so the total amount of work for your code to complete is around 364.5 billion rows read from MySourceTable and 364.5 billion index seeks.

A smaller issue is that you use a local variable @BATCHSIZE in a TOP expression without a RECOMPILE or some other hint. The query optimizer will not know the value of that local variable when creating a plan. It will assume that it equals 100. In reality you're deleting 4500 rows instead of 100, and you could possibly end up with a less efficient plan due to that discrepancy. The low cardinality estimate when inserting into a table can cause a performance hit as well. SQL Server might pick a different internal API to do inserts if it thinks that it needs to insert 100 rows as opposed to 4500 rows.

One alternative is to simply insert the primary keys/clustered keys of the rows that you want to delete into a temporary table. Depending on the size of your key columns this could easily fit into tempdb. You can get minimal logging in that case which means that the transaction log won't blow up. You can also get minimal logging against any database with a recovery model of SIMPLE. See the link for more information on the requirements.

If that's not an option then you should change your code so that you can take advantage of the clustered index on MySourceTable. The key thing is to write your code so that you do approximately the same amount of work per loop. You can do that by taking advantage of the index instead of just scanning the table from the beginning each time. I wrote a blog post that goes over some different methods of looping. The examples in that post do inserts into a table instead of deletes but you should be able to adapt the code.

In the example code below I assume that the primary key and clustered key of your MySourceTable. I wrote this code pretty quickly and am not able to test it:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

DECLARE @BATCHSIZE INT,
        @ITERATION INT,
        @TOTALROWS INT,
        @MSG VARCHAR(500)
        @STARTID BIGINT,
        @NEXTID BIGINT;
SET DEADLOCK_PRIORITY LOW;
SET @BATCHSIZE = 4500;
SET @ITERATION = 0;
SET @TOTALROWS = 0;

SELECT @STARTID = ID
FROM MySourceTable
ORDER BY ID
OFFSET 0 ROWS
FETCH FIRST 1 ROW ONLY;

SELECT @NEXTID = ID
FROM MySourceTable
WHERE ID >= @STARTID
ORDER BY ID
OFFSET (60000) ROWS
FETCH FIRST 1 ROW ONLY;

BEGIN TRY
    BEGIN TRANSACTION;

    WHILE @STARTID IS NOT NULL
        BEGIN
            WITH MySourceTable_DELCTE AS (
                SELECT TOP (60000) *
                FROM MySourceTable
                WHERE ID >= @STARTID
                ORDER BY ID
            )           
            DELETE FROM MySourceTable_DELCTE
            OUTPUT DELETED.*
            INTO MyBackupTable
            WHERE NOT EXISTS (
                                 SELECT NULL AS Empty
                                 FROM   dbo.vendor AS v
                                 WHERE  VendorId = v.Id
                             );

            SET @BATCHSIZE = @@ROWCOUNT;
            SET @ITERATION = @ITERATION + 1;
            SET @TOTALROWS = @TOTALROWS + @BATCHSIZE;
            SET @MSG = CAST(GETDATE() AS VARCHAR) + ' Iteration: ' + CAST(@ITERATION AS VARCHAR) + ' Total deletes:' + CAST(@TOTALROWS AS VARCHAR) + ' Next Batch size:' + CAST(@BATCHSIZE AS VARCHAR);             
            PRINT @MSG;
            COMMIT TRANSACTION;

            CHECKPOINT;

            SET @STARTID = @NEXTID;
            SET @NEXTID = NULL;

            SELECT @NEXTID = ID
            FROM MySourceTable
            WHERE ID >= @STARTID
            ORDER BY ID
            OFFSET (60000) ROWS
            FETCH FIRST 1 ROW ONLY;

        END;
END TRY
BEGIN CATCH
    IF @@ERROR <> 0
       AND @@TRANCOUNT > 0
        BEGIN
            PRINT 'There is an error occured.  The database update failed.';
            ROLLBACK TRANSACTION;
        END;
END CATCH;
GO

The key part is here:

WITH MySourceTable_DELCTE AS (
    SELECT TOP (60000) *
    FROM MySourceTable
    WHERE ID >= @STARTID
    ORDER BY ID
)   

Each loop will only read 60000 rows from MySourceTable. That should result in an average delete size of 4500 rows per transaction and a maximum delete size of 60000 rows per transaction. If you want to be more conservative with a smaller batch size that's fine too. The @STARTID variable advances after each loop so you can avoid reading the same row more than once from the source table.


Two thoughts spring to mind:

The delay is probably due to indexing with that volume of data. Try dropping the indexes, deleting, and re-building the indexes.

Or..

It may be faster to copy the rows you want to keep into a temporary table, drop the table with the 16 million rows, and rename the temporary table (or copy to new instance of the source table).