Can I optimize this MERGE statement?

Personally, I don't like MERGE because there are many unresolved bugs:

  • This blog post (scroll down to "Other MERGE issues")
  • This answer by AlexKuznetsov
  • These posts on Paul White's blog
  • I published a cautionary tip about MERGE here, including references to several bugs that haven't been addressed.
  • And this list may not even be exhaustive - given the number of bugs discovered by this small set of users, I have to wonder how many haven't been discovered yet?

MERGE also gives a false sense of security in terms of optimistic concurrency and race conditions. See Dan Guzman's blog post for more details.

I'm not trying to be a fear-mongerer here. But I also find the syntax unintuitive and daunting. So I would only use it in cases where it's actually needed and I can prove that I'm not affected by any of the above issues. I don't know what I would ever possibly gain from using it for an operation that could only ever end in an UPDATE anyway.

So here is how I would do it instead, using syntax that is much more familiar to me:

;WITH s AS 
(
  SELECT VisitorID, UserIpAddress FROM 
  (
    SELECT 
      VisitorID,
      UserIpAddress,
      rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
    FROM dbo.VisitorSession
    WHERE UserIpAddress IS NOT NULL
    AND CreateDate > @dt
  ) AS x
  WHERE rn = 1
)
UPDATE c
  SET c.UserIpAddress = s.UserIpAddress
  FROM dbo.ShoppingCart AS c
  INNER JOIN s
  ON c.VisitorID = s.VisitorID;

You can also break this operation into chunks to reduce the impact on the transaction log which may, in turn, reduce the overall duration. I blogged about this here.

Here is how I would handle that approach:

DECLARE 
  @dt DATE = DATEADD(MONTH, -6, SYSDATETIME()), 
  @rc INT = 1;

WHILE @rc > 0
BEGIN

  BEGIN TRANSACTION;

  ;WITH s AS 
  (
    SELECT TOP (100000) VisitorID, UserIpAddress FROM
    (
      SELECT 
        VisitorID,
        UserIpAddress,
        rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
      FROM dbo.VisitorSession AS s
      WHERE UserIpAddress IS NOT NULL
      AND CreateDate > @dt
      AND EXISTS
      ( 
        SELECT 1 FROM dbo.ShoppingCart AS c
          WHERE c.VisitorID = s.VisitorID
          AND (c.UserIpAddress <> s.UserIpAddress
          OR c.UserIpAddress IS NULL)
      )
    ) AS x
    WHERE rn = 1
  )
  UPDATE c
    SET c.UserIpAddress = s.UserIpAddress
    FROM dbo.ShoppingCart AS c
    INNER JOIN s
    ON c.VisitorID = s.VisitorID;

  SET @rc = @@ROWCOUNT;

  COMMIT TRANSACTION;
END

Of course, as the blog post illustrates, you can gain just about as much time by making sure your log is big enough to handle the entire transaction without having to grow - most of the delay is likely coming from many, many autogrow operations accommodating your large transaction. Sadly until you've done the operation once it can be very difficult to try and guess how much transaction log you'll need...


I ran a test against Aaron Bertrand's Common Table Expression (CTE). The performance was nearly identical (07:56 for the CTE verses 07:55 for the merge--only testing once each).

Next, I attempted Aaron's chunk'ed statement (to prevent blocking). However, I ran into an issue with grouping the records on every pass. The performance dropped to unacceptable levels because of this constraint on a such a large table. On the upside, his formulation gave me an idea on how to reduce the running time and avoid using MERGE.

Switching to a CTE was an easy choice. By reducing the result set from the smaller joining table first (the one with 9,000,000 records), the total records required to join on, and update, reduced significantly.

/* results from shared development server:
    run #1:  00:52 with 193,379 rows affected
    run #2:  00:56 with 193,379 rows affected
    run #3:  02:19 with 193,379 rows affected
    run #4:  00:15 with 193,379 rows affected
    run #5:  01:59 with 193,379 rows affected
    run #6:  01:14 with 193,379 rows affected
*/
declare @dt datetime = cast(dateadd(month, -6, sysdatetime()) as date);

;with cte as (
    select distinct  -- 00:26, with 179,160 rows
        visitorid
    from ShoppingCart (nolock) sc
    where CreateDate > @dt
)
update ShoppingCart
set UserIpAddress = desired.UserIPAddress
from ShoppingCart
join (
    select *  -- 02:36 with 179,016 rows
    from (
        select
            vs.visitorid
            ,vs.useripaddress
            ,row_number() over
            (partition by vs.visitorid order by vs.createdate desc) as rownumber
        from visitorsession (nolock) vs
        join cte on vs.VisitorID = cte.VisitorID
        where vs.CreateDate > @dt
        and vs.UserIPAddress is not null
    ) as subTbl
    where rownumber = 1
) as desired on ShoppingCart.VisitorID = desired.VisitorID
where ShoppingCart.CreateDate > @dt;

Feeding a much smaller result set from the CTE into the sub-select for the UPDATE radically increased performance on the UPDATE. Preserving the 6 month timeframe constraint on each step also helped keep the result set smaller for SQL Server to optimize performance. The final query typically runs for about 1 minute in the development environment. In production, this will most likely run in much less time, and I don't expect any requests to timeout with our current configuration.