Creating an Index that is not used by a (SELECT) query reduces performance of that query

Current State

In your "good" case the plan looks as below. The upper branch is sorted by SalesOrderID (and therefore also by SalesOrderID,ProductId as the grouping ensures there is only one row per SalesOrderID). The lower branch reads the index in SalesOrderID,ProductId order and they are merge joined together.

enter image description here

Your "bad" case uses a typical execution plan for a row-mode windowed aggregate (with a common subexpression spool). It isn't as bad as the linked video makes out. Reads are reported per row read not per page read for these work tables (so multiplying reads by 8 KB to calculate the data in the spool is certainly invalid) but nonetheless the "bad" case is costed more according to SQL Server's cost model so why does it choose it?

Merge join refresher

A merge join on columns X,Y requires both inputs to the merge join be ordered in a compatible manner. At a minimum both inputs must be ordered at least by the same initial column (i.e. both ordered by X or both ordered by Y). For greatest efficiency (i.e. ideally to avoid overhead of "many to many" merge join) they should generally both be ordered in the same manner for all keys involved in the equi join predicate - i.e. either ORDER BY X,Y or ORDER BY Y,X (the ASC, DESC direction of each key is not important but must be the same in both inputs)

How does it choose between `ORDER BY X,Y` and `ORDER BY Y,X`?

This seems to be a general limitation of merge join and composite join predicates. In the absence of any external reason to choose a specified ordering (e.g. an explicit ORDER BY) it will just decide a somewhat arbitrary column order.

For table sources apparently it will look for the first index it comes across that is suitable for providing the required columns in either order and adopt the key column order from that as the order required for the merge join.

The order used for index matching appears to be in reverse order of indexid. This will generally correlate with index creation order but not always (as clustered indexes are reserved an indexid of 1 or a gap in ids may be filled in following a DROP INDEX)

It will not do any analysis as to whether subsequent indexes might be better suited (e.g. as narrower or more compatible with index hints already used in the query).

I'm not 100% certain that this is the code responsible but the call stack below indicates that the implementation rule for sort merge join is looking in table metadata to find out some "Natural Sort"

enter image description here

Another example of adding an index making the plan worse - concentrating on merge join

CREATE TABLE #Test(X INT, Y INT, Filler CHAR(8000),INDEX IX1 UNIQUE /*Index id = 2*/ (X,Y));

INSERT INTO #Test (X,Y) SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY 1/0), ROW_NUMBER() OVER (ORDER BY 1/0) FROM sys.all_objects o1, sys.all_objects o2;

SET STATISTICS IO ON;

--#1 logical reads 60 uses IX1 - merge join chosen organically
SELECT t1.X, t1.Y
FROM #Test t1 
INNER JOIN #Test t2 ON t1.X = t2.X and t1.Y = t2.Y

SET STATISTICS IO OFF;
CREATE UNIQUE CLUSTERED INDEX ix2 ON #Test(Y,X) 
SET STATISTICS IO ON;

--#2 logical reads 50, Still using IX1 and merge join. The clustered index just created has an id of 1 so lower than 2. 
--IX1 no longer contains a RID so reads a bit lower than previously
SELECT t1.X, t1.Y
FROM #Test t1 
INNER JOIN #Test t2 ON t1.X = t2.X and t1.Y = t2.Y

SET STATISTICS IO OFF;
CREATE UNIQUE INDEX ix3 ON #Test(Y,X) INCLUDE (Filler);
SET STATISTICS IO ON;

--#3 logical reads 20,068 - No longer chooses MERGE join of its own accord and if forced uses more expensive index
SELECT t1.X, t1.Y
FROM #Test t1 
INNER MERGE JOIN #Test t2 ON t1.X = t2.X and t1.Y = t2.Y;

--#4 Back to 50 reads. The merge join is happy to use the order required by the ORDER BY
SELECT t1.X, t1.Y
FROM #Test t1 
INNER MERGE JOIN #Test t2 ON t1.X = t2.X and t1.Y = t2.Y
ORDER BY t1.X, t1.Y;

--#5 50 reads but now uses hash join 
SELECT t1.X, t1.Y
FROM #Test t1 
INNER JOIN #Test t2 ON t1.X = t2.X and t1.Y+0 = t2.Y+0;

--#6 50 reads, Forcing the merge join and looking at properties shows it is now seen as "many to many"
SELECT t1.X, t1.Y
FROM #Test t1 
INNER MERGE JOIN #Test t2 ON t1.X = t2.X and t1.Y+0 = t2.Y+0;

SET STATISTICS IO OFF;

--What if there is no useful index?
DROP INDEX ix3 ON #Test
DROP INDEX ix1 ON #Test
DROP INDEX ix2 ON #Test
CREATE CLUSTERED INDEX ix4 ON #Test(X)

--#7 Sorts are by X, Y 
SELECT t1.X, t1.Y
FROM #Test t1 
INNER MERGE JOIN #Test t2 ON t1.X = t2.X and t1.Y = t2.Y;

--#8 Sorts are by Y, X 
SELECT t1.X, t1.Y
FROM #Test t1 
INNER MERGE JOIN #Test t2 ON  t1.Y = t2.Y AND t1.X = t2.X;


DROP TABLE #Test

#1 There is a covering non clustered index on the table with columns (X,Y) - Merge join is chosen naturally (without hints) and uses index ordered scan.
#2 Add a clustered index with keys ordered (Y,X). Merge join still chosen naturally and using originally created index. The clustered index will have an id of 1 - which is lower than the pre-existing index.
#3 Add a covering nonclustered index with keys ordered (Y,X). This requires a lot more reads to scan than IX1 as includes a very wide column (Filler) - Nonetheless SQL Server now only considers (Y,X) ordered merge join and will not select merge join without hints due to the extra reads...
#4 ... Except if it has an additional reason to use the (X,Y) order. Adding an ORDER BY t1.X, t1.Y reverts to the original plan.
#5 Changing the predicate to t1.Y+0 = t2.Y+0 defeats the attempt to match (Y,X) but without hints it chooses a hash join
#6 Hinting the MERGE join shows it is now seen as "many to many" - Which is why SQL Server does not automatically select it. But when hinted it is able to use the narrower (X,Y) index.
#7 After dropping all useful indexes and creating a clustered index with single key column X it appears the merge join now wants X,Y order (rather than using the index on X to do a many to many merge join with a residual predicate)
#8 Inverting the order of the predicates from case #7 now also results in a different sort order for the merge join (of Y,X)

Why don't index hints work?

In the final example in your question where you use index hints SQL Server does consider merge join but as the merge join has already decided it is going to require column order of ProductId,SalesOrderID then a plan that used both the hinted indexes and a merge join would require a scan on the hinted index followed by a sort to get it into the needed order for the join. So this idea is dismissed on cost grounds (as the memo paths with <EnforceSort>PhyOp_Sort are more expensive than the eventual plan selected).

Related answer: Why does changing the declared join column order introduce a sort?

Addendum

Paul White added this insightful comment to the answer

another way to get the merge join without sorts in the question query is to change to ...WHERE ProductID + 0 = (SELECT AVG(ProductID)... Finding optimal ordering is NP-hard so database engines rely on heuristics.

In the case of the +0 workaround the merge join predicate is now on (SalesOrderID, Expr1004) = (SalesOrderID, Expr1002) so this is sufficient to prevent it trying to match an index with leading column ProductId. This does also rely on the semantics of GROUP BY guaranteeing that there is only one row in the upper input per SalesOrderID. Otherwise the extra opacity might cause SQL Server to conclude the merge join will be many to many so try other join types (as in the example now added to my example code above)

Trying to get the better performing execution plan structure by adding hints

ImplRestrRemap

The query rule ImplRestrRemap is used on the secondary query when the index is added as a result of it being estimated to be cheaper in query bucks than the merge join. This is more of a consequence of the problem. Why it has a lower subtree cost than using a merge join in your specific case is further down below.

You can find the use of this rule by checking the output tree by adding traceflags:

OPTION(
RECOMPILE, 
QUERYTRACEON 3604,
QUERYTRACEON 8607)

You can disable the rule by adding OPTION(QUERYRULEOFF ImplRestrRemap) to the query.

SELECT SalesOrderID, ProductId,SalesOrderDetailID, OrderQty
FROM   Sales.SalesOrderDetail sod
WHERE  ProductID = (SELECT AVG(ProductID)
                    FROM   Sales.SalesOrderDetail sod1
                    WHERE  sod.SalesOrderID = sod1.SalesOrderID
                    GROUP  BY sod1.SalesOrderID)
OPTION 
(
    QUERYRULEOFF ImplRestrRemap
);

This does give a hash join instead of a merge join and a higher estimated subtree cost than the spool plan. Another difference is the use of the two created nonclustered indexes instead of accessing one twice.

enter image description here

Merge Join

When trying to add the merge join back to the plan as well as disabling the rule you will see a sort added to the query plan:

OPTION 
(
    QUERYRULEOFF ImplRestrRemap, 
    MERGE JOIN
);

enter image description here

The estimated subtree cost is higher than the hash join, this hash join's estimated subtree cost is in turn higher than the spool plan. Explaining the choice of plan without the hints.

Once again it is still using the two created NC indexes.

Index hints

If we want to force the use of the same index for both table accesses to try and get the 'sample' plan, you could add index hints.

SELECT SalesOrderID, ProductId,SalesOrderDetailID, OrderQty
FROM   Sales.SalesOrderDetail sod WITH(INDEX(IX_SalesOrderID_ProductId))
WHERE  ProductID = (SELECT AVG(ProductID)
                    FROM   Sales.SalesOrderDetail  sod1 WITH(INDEX(IX_SalesOrderID_ProductId))
                    WHERE  sod.SalesOrderID = sod1.SalesOrderID
                    GROUP  BY sod1.SalesOrderID)
OPTION 
(
    QUERYRULEOFF ImplRestrRemap, 
    MERGE JOIN
);

This adds another sort operator:

enter image description here

Differences

Join direction

Because of the inner & outer side join columns of the merge join being swapped

No sort merge join execution plan

<InnerSideJoinColumns>
  <ColumnReference Database="[AdventureWorks2017]" Schema="[Sales]" Table="[SalesOrderDetail]" Alias="[sod]" Column="SalesOrderID" />
  <ColumnReference Database="[AdventureWorks2017]" Schema="[Sales]" Table="[SalesOrderDetail]" Alias="[sod]" Column="ProductID" />
</InnerSideJoinColumns>
<OuterSideJoinColumns>
  <ColumnReference Database="[AdventureWorks2017]" Schema="[Sales]" Table="[SalesOrderDetail]" Alias="[sod1]" Column="SalesOrderID" />
  <ColumnReference Column="Expr1002" />
</OuterSideJoinColumns>

Double sort merge join execution plan

<InnerSideJoinColumns>
  <ColumnReference Database="[AdventureWorks2017]" Schema="[Sales]" Table="[SalesOrderDetail]" Alias="[sod]" Column="ProductID" />
  <ColumnReference Database="[AdventureWorks2017]" Schema="[Sales]" Table="[SalesOrderDetail]" Alias="[sod]" Column="SalesOrderID" />
</InnerSideJoinColumns>
<OuterSideJoinColumns>
  <ColumnReference Column="Expr1002" />
  <ColumnReference Database="[AdventureWorks2017]" Schema="[Sales]" Table="[SalesOrderDetail]" Alias="[sod1]" Column="SalesOrderID" />
</OuterSideJoinColumns>

In the sense that the better performing query plan can pass through the index key path of

--> SalesOrderId - SalesOrderId --> ProductId - ProductId (Agg)

Whereas the sorting query plan can only follow

--> ProductId (Agg) - ProductId --> SalesOrderId - SalesOrderId

And as expected due to the added sorts, the index scan is not ordered when using the double sort merge join plan

 <IndexScan Ordered="false" ForcedIndex="true" ForceSeek="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">

Trying to force the better performing query plan

When trying to force the correct plan, for example running the query after creating the indexes in the order that does not provide an issue and then capturing the plan xml:

drop index if exists IX_ProductId_SalesOrderID on Sales.SalesOrderDetail;
drop index if exists IX_SalesOrderID_ProductId on Sales.SalesOrderDetail;

CREATE NONCLUSTERED INDEX IX_ProductId_SalesOrderID ON Sales.SalesOrderDetail (ProductId,SalesOrderID) INCLUDE (SalesOrderDetailID, OrderQty);
CREATE NONCLUSTERED INDEX IX_SalesOrderID_ProductId ON Sales.SalesOrderDetail (SalesOrderID, ProductId) INCLUDE (SalesOrderDetailID, OrderQty);

And then using that plan XML together with recreating the indexes in a different order:

drop index if exists IX_ProductId_SalesOrderID on Sales.SalesOrderDetail;
drop index if exists IX_SalesOrderID_ProductId on Sales.SalesOrderDetail;
CREATE NONCLUSTERED INDEX IX_SalesOrderID_ProductId ON Sales.SalesOrderDetail (SalesOrderID, ProductId) INCLUDE (SalesOrderDetailID, OrderQty);
CREATE NONCLUSTERED INDEX IX_ProductId_SalesOrderID ON Sales.SalesOrderDetail (ProductId,SalesOrderID) INCLUDE (SalesOrderDetailID, OrderQty);

The USE PLAN HINT reports an error for the same query and same indexes:

Msg 8698, Level 16, State 0, Line 1 Query processor could not produce query plan because USE PLAN hint contains plan that could not be verified to be legal for query. Remove or replace USE PLAN hint. For best likelihood of successful plan forcing, verify that the plan provided in the USE PLAN hint is one generated automatically by SQL Server for the same query.

Showing that using the same sort order in the merge join is not even possible and it has to revert back to use the sort operators as a best option for the merge join.

The reverse of this, applying the spool plan with a USE PLAN hint to the query when it is able to use the better performing Merge join plan is in fact possible.

The points brought up in this Q/A give more information as to why this is happening.

Solution

Order by

Adding an order by as specified in the Q/A previously mentioned results in the correct execution plan being chosen, even when the index create order is different.

SELECT SalesOrderID, ProductId
FROM   Sales.SalesOrderDetail2 sod 
WHERE  ProductID = (SELECT AVG(ProductID)
                    FROM   Sales.SalesOrderDetail2  sod1  
                    WHERE  sod.SalesOrderID = sod1.SalesOrderID
                    GROUP  BY sod1.SalesOrderID
                    )
ORDER BY SalesOrderID;

enter image description here

Creating an Index that is not used by a (SELECT) query reduces performance of that query

Current State

Merge join refresher

How does it choose between `ORDER BY X,Y` and `ORDER BY Y,X`?

Another example of adding an index making the plan worse - concentrating on merge join

Why don't index hints work?

Addendum

Trying to get the better performing execution plan structure by adding hints

Differences

Solution

Tags:

Performance

Sql Server

Index

Related

Recent Posts

Creating an Index that is not used by a (SELECT) query reduces performance of that query

Current State

Merge join refresher

How does it choose between ORDER BY X,Y and ORDER BY Y,X?

Another example of adding an index making the plan worse - concentrating on merge join

Why don't index hints work?

Addendum

Trying to get the better performing execution plan structure by adding hints

Differences

Solution

Tags:

Performance

Sql Server

Index

Related

How does it choose between `ORDER BY X,Y` and `ORDER BY Y,X`?