SQL Server 2016 Enterprise poor performance

You gave us a long (and very detailed) question. Now you have to deal with a long answer. ;)

There are several things I would suggest to change on your server. But lets start with the most pressing issue.

One time emergency measures:

The fact that the performance was satisfying after deploying the indexes on your system and the slowly degrading perfomance is a very strong hint that you need to start maintaining your statistics and (to a lesser degree) take care of index framentation.

As an emergency measure I would suggest an one time manual stats update on all of your databases. You can get the nessecary TSQL by executing this script:

DECLARE @SQL VARCHAR(1000)  
DECLARE @DB sysname  

DECLARE curDB CURSOR FORWARD_ONLY STATIC FOR  
   SELECT [name]  
   FROM master..sysdatabases 
   WHERE [name] NOT IN ('model', 'tempdb') 
   ORDER BY [name] 

OPEN curDB  
FETCH NEXT FROM curDB INTO @DB  
WHILE @@FETCH_STATUS = 0  
   BEGIN  
       SELECT @SQL = 'USE [' + @DB +']' + CHAR(13) + 'EXEC sp_updatestats' + CHAR(13)  
       PRINT @SQL  
       FETCH NEXT FROM curDB INTO @DB  
   END  

CLOSE curDB  
DEALLOCATE curDB

It is provided by Tim Ford in his blogpost on mssqltips.com and he is also explaining why updating statistics matter.

Please note that this is an CPU and IO intensive task that should not be done during buisiness hours.

If this solves your problem, please do not stop there!

Regular Maintenance:

Have a look at Ola Hallengren Maintenance Solution and then set up at least this two jobs:

  • A statistics update job (if possible every night). You can use this CMD code in your agent job. This job has to be created from scratch.

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d MSSYS -Q "EXECUTE dbo.IndexOptimize @Databases = 'USER_DATABASES', @FragmentationLow = NULL, @FragmentationMedium = NULL, @FragmentationHigh = NULL, @UpdateStatistics = 'ALL', @OnlyModifiedStatistics = 'Y', @MaxDOP = 0, @LogToTable = 'Y'" -b

  • An index maintenace job. I would suggest starting with a scheduled execution once a month. You can start with the defaults Ola provides for the IndexOptimize Job.

There are several reasons I am suggesting the first job to update stats separately:

  • An index rebuild will only update the statistics of the columns covered by that index while an index reorganization does not update statistics at all. Ola separates fragmentation in three categories. By default only category high indexes will be rebuild.
  • Statistics for columns not covered by an index will only be updated by the IndexOptimize job.
  • To mitigate the Ascending Key Problem.

SQL Server will auto update the statistics if the default is left enabled. The problem with that are the thresholds (less of a problem with your SQL Server 2016). Statistics get updated when a certain amount of rows change (20% in older Versions of SQL Server). If you have large tables this can be a lot of changes before statistics get updated. See more info on thresholds here.

Since you are doing CHECKDBs as far as I can tell you can keep doing them like before or you use the maintenance solution for that as well.

For more information on index fragmentation and maintenance have a look at:

SQL Server Index Fragmentation Overview

Stop Worrying About SQL Server Fragmentation

Considering you storage subsystem I would suggest no to fixate to much on "external fragmentation" because the data is not stored in order on your SAN anyway.

Optimize your settings

The sp_Blitz script gives you an excellent list to start.

Priority 20: File Configuration - TempDB on C Drive: Talk to your storage admin. Ask them if your C drive is the fastest disk available for your SQL Server. If not, put your tempdb there... period. Then check how many temdb files you have. If the answer is one fix that. If they are not the same size fix that two.

Priority 50: Server Info - Instant File Initialization Not Enabled: Follow the link the sp_Blitz script gives you and enable IFI.

Priority 50: Reliability - Page Verification Not Optimal: You should set this back to the default (CHECKSUM). Follow the link the sp_Blitz script gives you and follow the instruction.

Priority 100: Performance - Fill Factor Changed: Ask yourself why there are so many objects with fill factor set to 70. If you do not have an answer and no application vendor strictly demands it. Set it back to 100%.

This basically means SQL Server will leave 30% empty space on these pages. So to get the same amount of data (compared to 100% full pages) your server has to read 30% more pages and they will take 30% more space in memory. The reason it is often done is to prevent index fragmentation.

But again, your storage is saving those pages in different chunks anyway. So I would set it back to 100% and take it from there.

What to do if everybody is happy:

  • See the rest of the output of sp_Blitz and decide if you change them as suggested.
  • Execute sp_BlitzIndex and have a look at the indexes you created, if they are used or where there might be an opportunity to add/change one.
  • Take a look at your Query Store data (as suggested by Peter). You can find an introduction here.
  • Enjoy the rock-star live a DBA deserves. ;)

Not disregarding all your answers that were very useful and which I applied or will apply, the biggest problem was not easy to find.

The problem got worse in the days after our last messages.

As we are based on cloud, neither I nor the company that manages the infrastructure and gives us support has access to the physical hosts.

Something made me wonder when I noticed that some days the processor was on average at 20% and other days it was much higher, over 60%, when the workload, although never exactly the same, is similar. There are the same number of people performing more or less the same type of operations.

Earlier this week, users started to get stuck for several minutes and only the processor was strangled. I asked several users to log out (those who were spending more resources but still nothing out of the ordinary), I turned off various services linked to the database, and in the end nothing has changed. I asked the sysadmin that supports us and that can communicate with the guys of our cloud to remote to my machine to see what I was seeing and to help me find something, because I could not do better to find the problem.

The technician also did not find anything. He finally started to give me some reason that something else had to be causing this problem and was when he contacted the cloud. In the cloud, they did not realize anything, just that because there is configured load balancing between physical hosts, the VM that supports our SQL Server had been moved a few times that day between physical hosts. Fortunately, I told our technician exactly at what time the problems began to occur that day, which coincided with the time the VM had been moved the last time to one of the physical hosts from which it had not left the rest of the day.

If the technician had not followed closely this problem, this was going to be more one of those times when he could even talk to the cloud guys, but when they saw performance samples, they would not get anything, because once again the cloud only saw samples with CPU on the order of 40/50%, when in fact it was on average above 80% and often stuck at 100%.

Now the machine is standing on a physical host (not moving between hosts) and although we have not yet achieved the perfect performance, everyone is working and giving a lot more positive feedback, because the average CPU is about 20% with all our users and services.

In the meantime, we also put tempdb on another disk (it was on the Operating System disk) and we increased the files, to be more in agreement with the number of cores of the CPUs.

The number of cores were also adjusted based on the recommendations of sp_Blitz.

There was also an automatic routine that was running all day based on an old date ... and since it did not end in the morning when we arrived, and we have no way to check whether it is running or not, I still started to run manually. But probably the other was still running and were running twice during that time. We've changed the date to reduce the time it takes, and now it's late at night. But this was not the solution, as it was solved before many problems we had as the one described here.

We also managed to get the ERP assistant to schedule a meeting with the manufacturer, so we are going to show our system and look for suggestions, as well as clarify some doubts, as there are recommendations in the training videos that are contrary to most of recommendations, including Microsoft itself, such as Priority Boost on and Fill Factor 70%.

Since the application also has a maintenance screen, I'll look for the required periodicity of these maintenance, and what's left to do outside the application. My idea is to use Ola Hallengren's plans.

I believe that Thomas Kronawitter answer is absolutely correct and I'm applying it, however, I think this description can be important to other people that after following all the good practices still can't fix the problem because it can be in the physical hosts. Thanks Thomas.