What to do when your Always On cluster loses quorum?

AGs are based on Windows Clustering. The WSFC procedures for quorum loss apply.

  • WSFC Disaster Recovery through Forced Quorum
  • Force a WSFC Cluster to Start Without a Quorum

Once the WSFC is running, you can then force AG, if needed. Perform a Forced Manual Failover of an Availability Group:

After forcing quorum on the WSFC cluster (forced quorum), you need to force failover each availability group (with possible data loss). Forcing failover is required because the real state of the WSFC cluster values might have been lost. However, you can avoid data loss, if are able to force failover on the server instance that was hosting the replica that was the primary replica before you forced quorum or to a secondary replica that was synchronized before you forced quorum. For more information, see Potential Ways to Avoid Data Loss After Quorum is Forced.


What to do when your AlwaysOn cluster loses quorum?

I have been into this situation especially with Multi-subnet clustering spanning different countries (NY-LD-HK).

How to avoid Quorum Loss in a multi-subnet cluster ?

  • Change the cluster default setting to a more relaxed monitoring state especially Cluster Heartbeat settings using CrossSubnetDelay, or CrossSubnetThreshold property by this hotfix.
  • AG uses WSFC which inturn uses quorum based approach for determining cluster health. Make sure you proper choose and configure the quorum. This blog post dives deeper into Quorum vote configuration for AlwaysON
  • Things change in Windows server 2016 with the introduction of site aware clusters and cloud witness.

    Nodes in stretched clusters can now be grouped based on their physical location (site). Cluster site-awareness enhances key operations during the cluster lifecycle such as failover behavior, placement policies, heartbeating between the nodes and quorum behavior.

    Cloud Witness is a new type of Failover Cluster quorum witness that leverages Microsoft Azure as the arbitration point. It uses Microsoft Azure Blob Storage to read/write a blob file which is then used as an arbitration point in case of split-brain resolution.

What to do when Quorum is lost ?

  • If the cluster goes down due to an unplanned outage/disaster, then manual intervention is required. Either a windows admin or cluster admin has to manually force the quorum (linking back to @Remus's answer as that covers this point) and bring the surviving nodes online.

As always, to do a Root Cause Analysis (RCA), gather your windows cluster logs, for AlwaysON RCA - use SQL Server Failover Cluster Diagnostic Logs. These files in the SQL Server Log directory have the following format: <HOSTNAME>_<INSTANCENAME>_SQLDIAG_X_XXXXXXXXX.xel.