Slower SQL Server performance after allocating more CPU and RAM

There's a lot going on here, and most of it is pretty broad and vague.

  1. 2008R2 RTM came out on April 21, 2010. It's totally out of support. You'll want to prioritize getting on the latest Service Pack, which came out just about 3 years ago to the day. That way you'll be covered if you're hitting a weird bug or something. Head on over here to figure out what you need to download.

  2. Since you added vCPUs (from 1 to 4) and didn't change any settings, your queries can now go parallel. I know this sounds like they'll all be faster, but hang on!

  3. You may have added RAM, but you may not have changed Max Server Memory so your server can take advantage of it.

  4. Figure out what your server is waiting on. An open source project I work on provides free scripts to help you measure your SQL Server. Head on over here if you wanna give them a try.

You'll wanna grab sp_BlitzFirst to check out your server's wait stats. You can run it a couple ways.

This will show you what your server has been waiting on since it started up.

EXEC dbo.sp_BlitzFirst @SinceStartup = 1;

This will show you what queries are waiting on now, during a 30 second window.

EXEC dbo.sp_BlitzFirst @Seconds = 30, @ExpertMode = 1;

Once you figure out what queries are waiting on (there's a ton of stuff written about wait stats out there), you can start making changes to get things under control.

If you see them waiting on CXPACKET, that means your queries are going parallel, and maybe trampling over each other. If you hit this, you'll probably want to consider bumping Cost Threshold for Parallelism up to 50, and maybe dropping MAXDOP down to 2.

After this step is when you want to use something like sp_WhoIsActive or sp_BlitzWho (the latter is in the GitHub repo from earlier) to start capturing query plans. Aside from wait stats, they're one of the most important things you can look at to figure out what's going wrong.

You may also want to check out this article by Jonathan Kehayias about VMWare Counters to check out in relation to SQL Server.

Update

Reviewing the wait stats and boy are they weird. There's definitely something up with the CPUs. Your server is mostly sitting around bored, but when things heat up, things get bad. I'll try to break this down easily.

  1. You're hitting a poison wait called THREADPOOL. You don't have a ton of it, but that makes sense because your server isn't terribly active. I'll explain why in a minute.

  2. You have really long average waits on SOS_SCHEDULER_YIELD and CXPACKET. You're on a VM, so you're gonna wanna make sure that the SQL Server has reservations, or that the box isn't horribly oversubscribed. A noisy neighbor can really ruin your day here. You're also going to want to make sure that the server/VM guest/VM host aren't running in Balanced Power mode. This makes your CPUs spin down to unnecessarily low speeds, and they don't immediately spin back up to full speed.

  3. How do they tie in? With 4 CPUs you have 512 worker threads. Keep in mind, you had the same amount with a single CPU, but now that your queries can go parallel, they can consume many more worker threads. In your case 4 threads per parallel branch of a parallel query.

What's going parallel? Most likely everything. The default Cost Threshold for Parallelism is 5. That number was made the default sometime in the late 90s working on a desktop that looked like this.

NUTS

Granted, your hardware is smaller than most laptops, but you're still a bit ahead of that thing.

When lots of parallel queries get going, you're running out of those worker threads. When that happens, queries just sit around waiting for threads to get going. That's also where SOS_SCHEDULER_YIELD comes in. Queries are stepping off CPUs and not getting back on for a long time. I don't see any blocking waits, so you're most likely just all stuffed up on intra-query parallelism waits.

What can you do?

  1. Make sure nothing is in Balanced Power mode
  2. Change MAXDOP to 2
  3. Change cost threshold for parallelism to 50
  4. Follow the Jon K. article above to validate VM health
  5. Use the script called sp_BlitzIndex to look for any missing index requests.

For more thorough troubleshooting, check out the whitepaper I wrote for Google on hardware sizing in the cloud.

Hope this helps!


Yes! I've experienced this type of situation on SQL Server vms in our server farm. Look at the vm's host CPU ready time and the memory balloon driver counters. CPU READY TIME – BLOG PART I and Understanding VMware Ballooning Working with my sysadmin was key, but was not easy...


One thing I didn't see pointed out, is that adding vCPUs to a VM can very often slow it down due to scheduling.

The basic idea is that if a VM has 4 vCPUs, then the hypervisor must wait for 4 physical cores to be available so that is can schedule all of the vCPUs, even if 3 of them are idle.

If you don't have a lot of cores in your host, and your other workloads are busy, this can result in extra waiting, and a significant drop in performance.

In VMware ESXi you can see it in the advanced graphs via CPU Ready.

Here's one of many articles with a real world example of this happening and how it was diagnosed.

Adding more RAM can also cause a sudden performance drop if the VM's RAM allocation is larger than a NUMA node.

Additionally, the configuration of your vCPUs (vSockets vs. vCores) can actually affect some applications like SQL server. This is because SQL server is itself NUMA aware (to avoid the same kind of NUMA-spanning performance drop) and because VMware may present virtual NUMA nodes differently.

This is covered in a blog post on VMware's own site.


This being said, I'm glad you worked out the issues with Erik's help, but you may want to look at and consider these things as well.