SQL Server Agent Jobs and Availability Groups

Within your SQL Server Agent job, have some conditional logic to test for if the current instance is serving the particular role you are looking for on you availability group:

if (select
        ars.role_desc
    from sys.dm_hadr_availability_replica_states ars
    inner join sys.availability_groups ag
    on ars.group_id = ag.group_id
    where ag.name = 'YourAvailabilityGroupName'
    and ars.is_local = 1) = 'PRIMARY'
begin
    -- this server is the primary replica, do something here
end
else
begin
    -- this server is not the primary replica, (optional) do something here
end

All this does is pull the current role of the local replica, and if it's in the PRIMARY role, you can do whatever it is that your job needs to do if it is the primary replica. The ELSE block is optional, but it's to handle possible logic if your local replica isn't primary.

Of course, change 'YourAvailabilityGroupName' in the above query to your actual availability group name.

Don't confuse availability groups with failover cluster instances. Whether the instance is the primary or secondary replica for a given availability group doesn't affect server-level objects, like SQL Server Agent jobs and so on.


Rather than doing this on a per job basis (checking every job for the state of the server before deciding to continue), I've created a job running on both servers to check to see what state the server is in.

  • If its primary, then enable any job that has a step targeting a database in the AG.
  • If the server is secondary, disable any job targeting a database in the AG.

This approach provides a number of things

  • it works on servers where there are no databases in AG (or a mix of Db's in/out of AGs)
  • anyone can create a new job and not have to worry about whether the db is in an AG (although they do have to remember to add the job to the other server)
  • Allows each job to have a failure email that remains useful (all your jobs have failure emails right?)
  • When viewing the history of a job, you actually get to see whether the job actually ran and did something (this being the primary), rather than seeing a long list of success that actually didn't run anything (on the secondary)

the script checks the database in the field below if this database is in an Availability Group the script will take some action

This proc is executed every 15 mins on each server. (has the added bonus of appending a comment to inform people why the job was disabled)

/*
    This proc goes through all SQL Server agent jobs and finds any that refer to a database taking part in the availability Group 
    It will then enable/disable the job dependant on whether the server is the primary replica or not   
        Primary Replica = enable job
    It will also add a comment to the job indicating the job was updated by this proc
*/
CREATE PROCEDURE dbo.sp_HADRAgentJobFailover (@AGname varchar(200) = 'AG01' )
AS 

DECLARE @SQL NVARCHAR(MAX)

;WITH DBinAG AS (  -- This finds all databases in the AG and determines whether Jobs targeting these DB's should be turned on (which is the same for all db's in the AG)
SELECT  distinct
        runJobs = CASE WHEN role_desc = 'Primary' THEN 1 ELSE 0 END   --If this is the primary, then yes we want to run the jobs
        ,dbname = db.name
        ,JobDescription = CASE WHEN hars.role_desc = 'Primary'  -- Add the reason for the changing the state to the Jobs description
                THEN '~~~ [Enabled] using automated process (DBA_tools.dbo.sp_HADRAgentJobFailover) looking for jobs running against Primary Replica AG ~~~ '
                ELSE '~~~ [Diabled] using Automated process (DBA_tools.dbo.sp_HADRAgentJobFailover) because the job cant run on READ-ONLY Replica AG~~~ ' END 
FROM sys.dm_hadr_availability_replica_states hars
INNER JOIN sys.availability_groups ag ON ag.group_id = hars.group_id
INNER JOIN sys.Databases db ON  db.replica_id = hars.replica_id
WHERE is_local = 1
AND ag.Name = @AGname
) 

SELECT @SQL = (
SELECT DISTINCT N'exec msdb..sp_update_job @job_name = ''' + j.name + ''', @enabled = ' + CAST(d.runJobs AS VARCHAR) 
                + ',@description = ''' 
                + CASE WHEN j.description = 'No description available.' THEN JobDescription -- if there is no description just add our JobDescription
                       WHEN PATINDEX('%~~~%~~~',j.description) = 0 THEN j.description + '    ' + JobDescription  -- If our JobDescription is NOT there, add it
                       WHEN PATINDEX('%~~~%~~~',j.description) > 0 THEN SUBSTRING(j.description,1,CHARINDEX('~~~',j.description)-1) + d.JobDescription  --Replace our part of the job description with what we are doing.
                       ELSE d.JobDescription  -- Should never reach here...
                    END 
                + ''';'
FROM msdb.dbo.sysjobs j
INNER JOIN msdb.dbo.sysjobsteps s
INNER JOIN DBinAG d ON d.DbName =s.database_name     
ON j.job_id = s.job_id
WHERE j.enabled != d.runJobs   -- Ensure we only actually update the job, if it needs to change
FOR XML PATH ('')
)
PRINT REPLACE(@SQL,';',CHAR(10))
EXEC sys.sp_executesql @SQL

Its not fool proof, but for overnight loads and hourly jobs it gets the job done.

Even better than having this procedure run on a schedule, instead run it in response to Alert 1480 (AG role change alert).


I'm aware of two concepts to accomplish this.

Prerequisite: Based on Thomas Stringer's answer, I created two functions in the master db of our two servers:

CREATE FUNCTION [dbo].[svf_AgReplicaState](@availability_group_name sysname)
RETURNS bit
AS
BEGIN

if EXISTS(
    SELECT        ag.name
    FROM            sys.dm_hadr_availability_replica_states AS ars INNER JOIN
                             sys.availability_groups AS ag ON ars.group_id = ag.group_id
    WHERE        (ars.is_local = 1) AND (ars.role_desc = 'PRIMARY') AND (ag.name = @availability_group_name))

    RETURN 1

RETURN 0

END
GO

CREATE FUNCTION [dbo].[svf_DbReplicaState](@database_name sysname)
RETURNS bit
AS
BEGIN

IF EXISTS(
    SELECT        adc.database_name
    FROM            sys.dm_hadr_availability_replica_states AS ars INNER JOIN
                             sys.availability_databases_cluster AS adc ON ars.group_id = adc.group_id
    WHERE        (ars.is_local = 1) AND (ars.role_desc = 'PRIMARY') AND (adc.database_name = @database_name))

    RETURN 1
RETURN 0

END

GO


  1. Make a job terminate if it's not executed on the primary replica

    For this case, every job on both servers needs either of the following two code snippets as Step 1:

    Check by group name:

    IF master.dbo.svf_AgReplicaState('my_group_name')=0
      raiserror ('This is not the primary replica.',2,1)
    

    Check by database name:

    IF master.dbo.svf_AgReplicaState('my_db_name')=0
      raiserror ('This is not the primary replica.',2,1)
    

    If you use this second one, beware of the system databases though - by definition they can not be part of any availability group, so it'll always fail for those.

    Both of these work out of the box for admin users. For non-admin users, you have to do add extra permissions, one of them suggested here:

    GRANT VIEW SERVER STATE TO [user];
    GRANT VIEW ANY DEFINITION TO [user];
    

    If you set the failure action to Quit job reporting success on this first step, you won't get the job log full of ugly red cross signs, for the main job they'll turn into yellow warning signs instead.

    From our experience, this is not ideal. We at first adopted this approach, but quickly lost track regarding finding jobs that actually had a problem, because all the secondary replica jobs cluttered the job log with warning messages.

    What we then went for is:

  2. Proxy jobs

    If you adopt this concept, you'll actually need to create two jobs per task you want to perform. The first one is the "proxy job" that checks if it's being executed on the primary replica. If so, it starts the "worker job", if not, it just gracefully ends without cluttering the log with warning or error messages.

    While I personally don't like the idea of having two jobs per task on every server, I think it's definetly more maintainable, and you don't have to set the failure action of the step to Quit job reporting success, which is a bit awkward.

    For the jobs, we adopted a naming scheme. The proxy job is just called {put jobname here}. The worker job is called {put jobname here} worker. This makes it possible to automate starting the worker job from the proxy. To do so, I added the following procedure to both of the master dbs:

    CREATE procedure [dbo].[procStartWorkerJob](@jobId uniqueidentifier, @availabilityGroup sysname, @postfix sysname = ' worker') as
    declare @name sysname
    
    if dbo.svf_AgReplicaState(@availabilityGroup)=0
        print 'This is not the primary replica.'
    else begin
        SELECT @name = name FROM msdb.dbo.sysjobs where job_id = @jobId
    
        set @name = @name + @postfix
        if exists(select name from msdb.dbo.sysjobs where name = @name)
            exec msdb.dbo.sp_start_job @name
        else begin
            set @name = 'Job '''+@name+''' not found.'
            raiserror (@name ,2,1)
        end
    end
    GO
    

    This utilizes the svf_AgReplicaState function shown above, you could easily change that to check using the database name instead by calling the other function.

    From within the only step of the proxy job, you call it like this:

    exec procStartWorkerJob $(ESCAPE_NONE(JOBID)), '{my_group_name}'
    

    This utilizes Tokens as shown here and here to get at the current job's id. The procedure then gets the current job name from msdb, appends  worker to it and starts the worker job using sp_start_job.

    While this is still not ideal, it keeps the job logs more tidy and maintainable than the previous option. Also, you can always have the proxy job run with a sysadmin user, so adding any extra permissions isn't necessary.