Long Running Query on Read-Only Replica that takes moments on the Primary

This answer is in addition to Joe's answer as I can't be 100% certain it is the version store, however there is enough evidence so far to imply that to be part of the issue.

When a secondary replica is marked as readable a good steady state for versioning information needs to first be attained so that there is a known and good starting point for all read operations on the secondary. When this is waiting to transition and there are still open transactions on the primary, this will manifest as HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING and is also a good indicator that the primary does go through quite a bit of data churn (or at least someone has a really long open transaction which also isn't good). The longer the transactions are open and the more data changes there are, the more versioning will occur.

Secondary replicas achieve readable status by using snapshot isolation under the covers for the session, even though if you check your session information you'll see it show up at the default read committed. Since snapshots isolation is optimistic and uses the version store, all changes will need to be versioned. This is exacerbated when there are many running (and potentially long running) queries on the secondary while the churn of data is high on the primary. Generally this manifests only in a few tables for an OLTP system but it's completely application and workload dependent.

The version store itself is measured in generations and when a query is run which requires the use of the version store, the versioning record pointer is used to point to the TempDB chain of that row. I say chain, because it's a list of versions for that row and the entire chain must be walked sequentially to find the proper version based on the starting timestamp of the transaction so that the results are inline with the data at that given time.

If the version store have many generations for these rows due to long running transactions on the primary and secondary replicas, this will cause longer than average times for queries to run and generally in the form of higher CPU while all other items seemingly stay exactly the same - such as execution plan, statistics, rows returned, etc. The walking of the chain is almost a purely cpu operation, so when the chains become very long and the amount of rows returned is very high, you get a (not linear, but can be close) increase in time for the query.

The only thing that can be done is to limit the length of the transactions on both the primary and the secondary to make sure the version store isn't becoming too large in TempDB while having many generations. Attempts to clean up the version store happen roughly once a minute, however cleanup requires that all versions from the same generation no longer be needed before it can be removed and all future versions can't be cleaned until the oldest version is no longer needed. Thus, a long running query can cause the inability to effectively cleanup many unused generations.

Switching the replica in and out of readable mode will also clear out the version store as it is no longer readable.

There are other items that could also be at play, but this seems the most plausible given the current data and way the replica was reacting.

TempDB Versioning DMVs (not to be confused with ADR versioning).


Disclaimer: I don't know anything about availability groups, but I do know a bit about troubleshooting queries that appear to use more CPU then they ought to.

You have a CPU problem in that you're using too much of it. One important thing to say about waits is almost all of them aren't CPU busy. When a worker enters a waiting state it has yielded and is no longer running on the scheduler in SQLOS. So if you have a MAXDOP 1 query with the following run statistics:

CPU time = 55719 ms, elapsed time = 56335 ms.

You hit almost 99% CPU utilization for the query. Why should there be meaningful wait statistics for that query? You might see some if you have some of the CPU busy waits such as external or preemptive waits, but that's not guaranteed either. The bottom line is that wait statistics may not be that helpful here.

There are some things to check in rough order (the order depends on what you know about the environment):

  • Does the secondary server have any expensive monitoring going on such as extended events, traces, or profiling?
  • Does the hardware of the secondary server roughly match the primary?
  • Are there any configuration or software issues with the secondary server?
  • Any significant waits or latches? Might not be applicable to your query but may still provide clues.
  • Any significant spinlocks?
  • Are there other DMVs or things that you can check within SQL Server that might give clues? You mentioned that Availability Groups are likely a key part of the problem.
  • What does ETW tracing tell you?
  • What kind of support agreements do you have?

Most of the above is well covered various blog posts and documentation, but I'll expand on ETW tracing. If you want to know why SQL Server is using so much CPU for a particular query and you have access to the host you can always do ETW tracing to view callstacks and to see how much CPU various callstacks are doing. In other words, the host OS is happy to tell you what CPU is being used for if you know how to ask. Common methods of doing ETW tracing include Windows Performance Recorder and PerfView.

Making sense of those results requires deep internals knowledge and it's easy to arrive at a wrong conclusion. In many cases it's best to collect the raw data and to ask experts to look at it. When performing a trace you want as little activity as possible to be going on in SQL Server. Below are a few answers posted here that use ETW tracing to draw conclusions about SQL Server:

  • How to diagnose try_parse being very slow?
  • Why can it take up to 30 seconds to create a simple CCI rowgroup?
  • Does logical reads in RAM appears in wait stats or where?
  • Why does NOLOCK make a scan with variable assignment slower?
  • What is the algorithm behind the EXCEPT operator?
  • ALTER TABLE command with no default value taking forever?
  • Data implicitly ordered as if it were a clustered index; Can I take advantage of that?

I suspect that in your case if you're able to collect callstacks while the 45 second query runs you'll get some very helpful clues about the nature of the problem.


As the issue self-resolved, I'm left to speculate its cause (rhyming not intentional). Based on Sean's post and the fact that an open Query Store transaction looks to have been the root cause to my increased version store size (e.g. the cause to the HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING waits), I can only assume the Query Store had a part in the behavior that was presented. This database is larger (~6TB), quite active, and a gross majority of the queries hitting it are generated at the client and not parameterized (i.e. ad-hoc queries), so I don't believe the Query Store lends itself to providing a lot of use in this scenario. Because of that, we will be disabling the Query Store in this environment during a future maintenance window, after which I suspect we won't see this behavior again.

We did open a ticket with Microsoft, but timing wasn't in our favor as the issue was resolved before we could do any detailed analysis via a PSSDIAG trace or the like. I'm hoping they will be able to do some localized testing and replicate this issue in case this is a bug we ran up against. If any further updates on a more permanent resolution are identified, I'll be sure to update this answer as well.