SQL Server Distributed Availability Group databases not syncing after a server reboot

Please note, this is not a definitive answer but it's the best answer after chatting with Taryn.

However, the primary was showing a very different story. It was reporting that the separate AG was syncing without any issues but the DAGs were in a Not Synchronzing / Not Healthy state

If the individual databases and AGs underlying the distributed ag say they healthy and synchronizing, there is a good chance this is just a hiccup in the DMVs and/or SSMS dashboards. Since there was nothing in the errorlog to suggest the replica didn't connect or was in a disconnected state.

Unfortunately since the issue has resolved, it's hard to say exactly what it was... but in the future if this occurs for someone:

Check sys.dm_hadr_database_replica_states on all clusters looking for anything that isn't healthy. If all shows healthy, it's possible the DMV hasn't updated yet
If it's unhealthy check the errorlog/DMVs for connectivity issues (such as not being able to connect to the forwarder/global primary)
Dan's answer mentions issues that could arise from database startup - though in this case the instance can't be read so that most likely wasn't an issue but could be in your case
If the database is readable, smoke test with a dummy table/insert or ...
Extended event session using the DEBUG channel items sqlserver.hadr_dump_log_block or sqlserver.hadr_apply_log_block to see if the secondary is actually receiving/applying the log blocks or ...
Perfmon object SQLServer:Database Replica\Log Bytes Received/sec

If you're receiving data on that secondary but the distributed ag still shows not synchronizing or not healthy then I'd let it go for a bit to see if the DMV values change since it's obviously receiving and processing log blocks.

If, however, it isn't then we'll need to investigate further which is out of scope of the answer.

I'll preface this all with the caveat that I do not have any DAGs in production. Fundamentally though this advice should apply between both AGs and DAGs.

Did the synchronization resume following the service restart? If so then my best guess to the cause would be blocking on the redo SPID. If it's still not synchronizing even after the restart, here's what I'd be checking first:

Blocking of AG redo SPID

Generally only going to occur on a readable secondary. To check, run the following:

select session_id, blocking_session_id, db_name(database_id), wait_type
from sys.dm_exec_requests
where command = 'DB STARTUP'

If any blocking SPIDs appear, you'll need to kill them before the secondary can resume (the DB STARTUP SPID is what handles the redo operations). I'd suggest reviewing the blocking SPID beforehand to try and determine the cause (usually a long running report).

If you want further information on this, there's a great article (including monitoring for this type of behaviour using XEs) here.

Check DMVs

If data movement is suspended, you can refer to DMVs to get more information on the suspend reason. Run the following:

select db_name(database_id), synchronization_state_desc, database_state_desc, suspend_reason_desc
from sys.dm_hadr_database_replica_states

The BOL article describes the suspend_reason a little further.

SQL Server Distributed Availability Group databases not syncing after a server reboot

Tags:

Sql Server

Upgrade

Sql Server 2017

Availability Groups

Distributed Availability Groups

Related

Recent Posts