Clustering vs. transactional replication vs. availability groups

The way I always like to visualize high availability solutions is the following:

SQL Server Failover Cluster Instance (FCI)

What is highly available? The entire instance. That includes all server-objects (logins, SQL Server Agent jobs, etc.). This also includes databases and their containing entities. It's a great solution for highly available SQL Server instances, because that is going to be the level of containment with this given solution.

What about reporting? None, NULL, nonexistent. A failover cluster instance has an active node delivering the cluster group containing the instance, VNN, etc. and all other nodes are passive, sitting idle (as far as the current cluster group is concerned) and waiting for a failover.

What happens when there is failover? The downtime for an FCI is going to be determined by the amount of time that the passive node takes to grab the cluster resource and bring the SQL Server instance in a running state. This is typically minimal in time.

Any client abstraction? Yes, this is going to be innately built in with the virtual network name for the failover cluster instance. This will always point to the active node that is currently delivering the SQL Server cluster resource.

AlwaysOn Availability Groups

What is highly available? An availability group is going to be the logical containment of high availability here, whereas an availability group consists of a number of databases and a virtual network name (the listener, an optional cluster resource). It is worth noting that server objects such as logins and SQL Server Agent jobs will not be part of the HA solution, and special consideration needs to be taken to ensure that these are properly implemented with an availability group. Not an overly burdening requirement, but needs to be cared for.

What about reporting? This is a great solution for reporting, although I probably wouldn't use a synchronous replica as my reporting instance. There are two commit relationships, synchronous and asynchronous. In my opinion and from what I've seen in practice, is that your synchronous secondary replica is there waiting for a disaster. Think of it as that replica that's ready to take a no-data-loss failover in the event of an issue. Then there are asynchronous replicas that can handle that reporting workload. You aren't using this replica as the aforementioned solution, but moreso for things like reporting. Reporting workloads can be pointed to this replica (either directly, or indirectly through read-only routing via the listener).

What happens when there is failover? For a synchronous commit secondary replica that is paired with automatic failover, this will be the replica role state change from SECONDARY_NORMAL to PRIMARY_NORMAL. In order for there to be automatic failover, you need to have a synchronous secondary replica that is currently synchronized, and what's implemented is the Flexible Failover Policy to determine when in fact this failover should occur. That policy is indeed configurable.

Any client abstraction? Yes, you could optionally configure an AlwaysOn Availability Group listener. This is basically just a virtual network name (can be seen through WSFC as a cluster resource in the AG's cluster group) that points to the current primary replica. This is a key part of shifting your reporting workload around, as well as setting up a read-only routing list on any servers that you want to redirect ReadOnly traffic (this is set through the connection string, with the .NET Framework Provider for SQL Server, this will be the Application Intent parameter, set to ReadOnly). You would also need to set a read-only routing URL for each replica that you want to receive this reporting workload while in the secondary replica role.

Transactional Replication

What is highly available? This is arguable, but I'm going to say nothing. I don't see replication as a high availability solution whatsoever. Yes, data modifications are being pushed to subscribers but we're talking at the publication/article level. This is going to be a subset of the data (could include all the data, but that won't be enforced. I.e. you create a new table in the publisher database, and that will not automatically be pushed to the subscribers). As far as HA goes, this is bottom-of-the-barrel and I will not group it in there with a rock-solid HA solution.

What about reporting? A great solution for reporting on a subset of data, no question about that. If you have a 1 TB database that is highly transactional and you want to keep that reporting workload off the OLTP database, transactional replication is a great way to push a subset of data to a subscriber (or subscribers) for the reporting workload. What happens if out of that 1 TB of data your reporting workload is only about 50 GB? This is a smart solution, and relatively configurable to meet your business needs.

Summary

What it boils down to are a handful of questions that need to be answered (partly by the business):

  1. What needs to be highly available?
  2. What does the SLA dictate for HA/DR?
  3. What kind of reporting will be taking place and what latencies are acceptable?
  4. What do we need to handle with geographically dispersed HA? (storage replication is expensive, but a must with an FCI. AGs don't require shared storage from standalone instances, and you could use a file share witness for quorum potentially eliminating the need for shared storage)

two (or more) servers in a Windows Failover cluster, SQL Server as a clustered instance

  1. What Kind of workload? "It depends" - but seriously, this is useful for an online application where you need to have local in data center High Availability. You are protected against a failure of one machine, or of one operating system. The logins, jobs, new databases, maintenance, etc. all are automatically kept in sync by the fact that it is a cluster with two nodes that are exactly the same sharing the same storage so they have all the same system databases. Very fast failover, but there is still a hiccup that looks like a SQL Server restart when the failover occurs.

  2. Cons/Concerns - Single point of failure is your storage and all of it's components. SAN vendors always say "SANs don't fail" but there are lot of moving parts in a storage area network and as I blogged about here, they can. Also - you are paying for a secondary server that can't do anything but hang around and wait.. Now you can do Active/Active/Multi-Node and have two active instances that can failover in either direction and use the second node.

  3. Automatic Failover? The "most" automatic. No witness needed, it's a cluster. This is the job of a cluster, to make it as seamless as possible. Now with any of these, when a failover happens you will "feel" it, because SQL has to start up or connections have to point. Here when it happens, you'll basically feel like a restart of SQL, DBs come back up and run recovery/etc.

If I have a client say "I want to be fully up with all databases, all logins, etc" in a High Availability environment in my local data center because I have an incredibly low tolerance for downtime I would consider Failover Cluster Instances (though the last option you mention is a strong contender, save for having to do some management overhead). I would probably do a local FCI and an AG async secondary to protect against site failure or SAN failure.

two (or more) SQL Server instances that are kept up to date with transactional replication

  1. What Kind of workload? I honestly wouldn't go here for a lot of cases of a need for High Availability or Disaster Recovery as a first choice. Not in SQL 2012 for sure. But basically this is good if you had to go to a data center that wasn't close, you couldn't use an AG (maybe a domain issue preventing you from using the windows cluster required for the AG), perhaps you wanted to be in SQL Server standard which can do replication, but not AGs but you still wanted to have the ability to read on the secondary side and be asynchronous.
  2. Cons/Concerns - It's replication. It has overhead, it can get out of sync, you can develop problems with performance on the source side, etc.
  3. Automatic Failover - No. You have to manage it yourself. Either through CNAMEs that point to one or the other, and you could theoretically write your own process to do this, but out of the box? Note here.

two (or more) SQL Servers in a SQL Server Availability Group, configured in a synchronous commit mode

This is what I've been helping people implement more and more lately, though sometimes I still go to clustering.

  1. What Kind of Workload? This is great when I have a manageable set of databases to keep in sync, and the resources and time to make sure that jobs, logins, new databases, etc stay in sync (though the team at SQL Skills have built a great add in to automate some of this for you making it even stronger of an option). I like this when I want to keep things completely separate. I am protecting against hardware issues, OS issues, SQL installation issues, patching issues and SAN/Storage issues. I also get the benefit of the ability to have a secondary (If I want to pay an enterprise license for it) to be an active secondary that I can read from, take backups on, etc. Plus in the future I can add a third secondary that is asynchronous at a remote site and have failover/DR.
  2. Cons/Concerns Licensing, maximum number of replicas, licensing costs to take advantage of some of the biggest benefit (active secondary), requires enterprise, requires twice as much storage than clustering.
  3. Automatic Failover - Yes. This can occur with a witness setup, and your app developers can connect to the listener instead of a node so the failover happens with where the listener points and you should be good there. So yes you can do that here - and should - but of course you should test it well.

Summary

HA and DR are different. And these technologies help provide pieces of either. High Availability means (to me) that you can quickly recover if something bad happens to one machine, you are up with a short Recovery Point Objective and Recovery Time Objective. That is clustering and a synchronous AG.

Disaster Recovery is "you can get up when you have a failure even in your HA solution. To me that can be AGs when you go to another data center, mirroring or even replication.


It is also important to consider what is shared.

Failover Clustering uses two or more server nodes sharing one disk array. If the disk array goes down then you lose service, regardless of how many server nodes there are. If the server room where that disk array is located catches fire or floods then you lose service.

AlwaysOn Availability Groups and Database Mirroring are a "shared nothing" clustering technology. The database is present on multiple disk arrays in multiple servers. If you have good network links then the multiple serves can be in multiple server rooms, protecting you against fires and floods.