Unexplained log ship slow down

This is bigger than a comment but something to test first and then implement :

You are logshipping 800+ databases. Thats a large amount of databases that you logship every 15 mins.

You should offload some databases to another server. IMHO 800 databases on a single server is a lot!

We had similar problem with logshipping when we logshipped 200+ databases from NY to LD region.

What we did is as below :

  • There was blocking writing to msdb.dbo.sysjobhistory with (TABLOCKX). The TABLOCK hint means that access to the sysjobhistory is always serialized. And since you have lot of jobs running every 15mins, there will be contention (blocking).

  • We implemented trace flag TF – 1236. It will introduce Database lock partitioning. Partitioning the DATABASE lock keeps the depth of the lock list manageable in each local partition. This significantly optimizes the access path that is used to obtain a DATABASE lock.

  • create indexes on sysjobhistory and log_shipping_monitor_history_detail tables as below :

    use msdb
        go
    
        create nonclustered index [nc_DBA_sysjobhistory] on dbo.sysjobhistory (
                ,[job_id] 
                ,[step_id]     
                ,[run_date]
                ) 
        include (
                [instance_id]
                ,[step_name]
                ,[sql_message_id]
                ,[sql_severity]
                ,[message]
                ,[run_status]
                ,[run_time]
                ,[run_duration]
                ,[retries_attempted]
                ,[server]
                )
        go
    
        use [msdb]
        go
        ----- this will help sys.sp_MSprocesslogshippingretentioncleanup proc (delete from msdb.dbo.log_shipping_monitor_history_detail).. that does the cleanup of logshipping.
    
        create nonclustered index [nc_DBA_LogShipping_monitor_history_detail] on [dbo].[log_shipping_monitor_history_detail] (
                [agent_id] asc
                ,[agent_type] asc
                ,[log_time_utc] asc
                )
        go
    

I loaded all the databases being shipped into this query:

DECLARE @path NVARCHAR(260);

SELECT 
   @path = REVERSE(SUBSTRING(REVERSE([path]), 
   CHARINDEX(CHAR(92), REVERSE([path])), 260)) + N'log.trc'
FROM    sys.traces
WHERE   is_default = 1;

SELECT *, rn = ROW_NUMBER() OVER 
  (PARTITION BY DatabaseName ORDER BY StartTime)
INTO #blat
FROM sys.fn_trace_gettable(@path, DEFAULT) 
WHERE DatabaseName IN (
  N'db1', N'db2' -- , ...
)
ORDER BY StartTime DESC; 

SELECT b.DatabaseName, b.TextData, 
  ApproximateRestoreTime = DATEDIFF(MILLISECOND, b.StartTime, b2.StartTime)
FROM #blat AS b 
LEFT OUTER JOIN #blat AS b2
ON b.DatabaseName = b2.DatabaseName
AND b2.rn = b.rn + 1
WHERE b.EventClass = 115 AND b.EventSubClass = 2
ORDER BY b.StartTime DESC;

GO
DROP TABLE #blat;

(But I edited it to give me the start time, just as a sanity check.) This gave me a slew of results like:

DatabaseName    ApproximateRestoreTime  StartTime

DB1             2228166                 5/26/16 12:07 PM
DB1             370                     5/26/16 12:07 PM
DB1             373                     5/26/16 12:07 PM
DB1             366                     5/26/16 12:07 PM
DB1             383                     5/26/16 12:07 PM
DB1             350                     5/26/16 12:07 PM
DB1             350                     5/26/16 12:07 PM
DB1             1730                    5/26/16 12:07 PM
DB1             1726                    5/26/16 12:07 PM
DB1             426                     5/26/16 12:07 PM
DB1             1946                    5/26/16 12:07 PM

DB2             2237880                 5/26/16 12:07 PM
DB2             2420                    5/26/16 12:07 PM
DB2             2933                    5/26/16 12:07 PM
DB2             1723                    5/26/16 12:07 PM
DB2             360                     5/26/16 12:07 PM
DB2             353                     5/26/16 12:07 PM
DB2             1370                    5/26/16 12:07 PM
DB2             5433                    5/26/16 12:07 PM
DB2             400                     5/26/16 12:07 PM
DB2             436                     5/26/16 12:07 PM
DB2             856                     5/26/16 12:07 PM

DB3             2255540                 5/26/16 12:07 PM
DB3             2513                    5/26/16 12:07 PM
DB3             390                     5/26/16 12:07 PM
DB3             360                     5/26/16 12:07 PM
DB3             470                     5/26/16 12:07 PM
DB3             4830                    5/26/16 12:07 PM
DB3             1046                    5/26/16 12:07 PM
DB3             2753                    5/26/16 12:06 PM
DB3             373                     5/26/16 12:06 PM
DB3             933                     5/26/16 12:06 PM
DB3             813                     5/26/16 12:06 PM

DB4             2272020                 5/26/16 12:06 PM
DB4             2290                    5/26/16 12:06 PM
DB4             1936                    5/26/16 12:06 PM
DB4             353                     5/26/16 12:06 PM
DB4             353                     5/26/16 12:06 PM
DB4             360                     5/26/16 12:06 PM
DB4             393                     5/26/16 12:06 PM
DB4             4000                    5/26/16 12:06 PM
DB4             853                     5/26/16 12:06 PM
DB4             2133                    5/26/16 12:06 PM
DB4             346                     5/26/16 12:06 PM

That looks to me like the databases are restoring just fine and the slowdown is in writing this information to msdb. This combined with the fact that sp_WhoIsActive showed a lot of slowish-looking:

<?query --
(@param0 nvarchar(20), @param1 nvarchar(46), @param2 nvarchar(48), @param3 nvarchar(48), @param4 nvarchar(48), @param5 nvarchar(48), @param6 nvarchar(48), @param7 nvarchar(48), @param8 nvarchar(48), @param9 nvarchar(42), @param10 nvarchar(77), @param11 nvarchar(24), @param12 nvarchar(9), @param13 nvarchar(16), @param14 nvarchar(10), @param15 nvarchar(28), @param16 nvarchar(7), @param17 nvarchar(7), @param18 nvarchar(9), @param19 nvarchar(3), @param20 nvarchar(24), @param21 nvarchar(13), @param22 nvarchar(3), @param23 nvarchar(27), @param24 nvarchar(9), @param25 nvarchar(24), @param26 nvarchar(3), @param27 nvarchar(27))
declare @backup_set_id int
declare @restore_history_id int
select @backup_set_id = backup_set_id from msdb.dbo.backupset where backup_set_uuid = N'{4A3BE7C3-AF68-40D8-9261-35C781B6642C}'
if @backup_set_id is null begin
declare @media_set_id int
declare @media_count int
select @media_set_id = media_set_id from msdb.dbo.backupmediaset where media_uuid = N'{D603F5E0-6150-476A-A5B1-E7D2840AB05C}'
(etc., etc.)

made me think that yeah, msdb is my problem.

It only had a week's worth of history in it as of this morning, but there are over 800 databases being restored every 15 minutes, so the MDF was still 5 GB. I disabled the jobs and reduced that number to 3 days and added some indexes. The job is now taking 4 or 5 minutes.

I'm still not sure why a system that's been running for over two years, happily, with four weeks of history now only runs properly with three days, but I suspect that my storage team is wrong and my storage is less than optimal. I've set them on evaluating that.

In the meantime, if one of the servers this instance is protecting bluescreens I won't be two and a half hours out of date!

The hosting provider now says that C: is fragmented. I've moved msdb to faster storage anyway, since it's the hardest working database on the system.