Does CPU utilization affect the cost of foreign NUMA access?

A hefty question :-) I'll outline some of the factors involved. In any given context, these factors and others can vary and produce an interesting result.

Sorry I wasn't able to make this much shorter...

  1. Accumuated CPU ms vs logical IO
  2. SQL Server logical memory node alignment with physical NUMA nodes
  3. Spinlock contention in query workspace memory allocation
  4. Task assignment to schedulers
  5. Relevant data placement in the buffer pool
  6. Physical memory placement

  1. Accumuated CPU ms vs logical IO

    I use graphs of logical IO (or in perfmon terminology "buffer pool page lookups") against CPU utilization very often, in order to gauge cpu efficiency of workloads and look for spinlock prone cases.

    But SQL Server accumulates CPU time with lots of activity other than page lookups and spinlocks:

    • Plans are compiled and re-compiled.
    • CLR code is executed.
    • Functions are performed.

    A lot of other activities will chew up significant cpu time without being reflected in the page lookups.

    In the workloads I observe, chief among these "non logical IO intensive but CPU-gobbling" activities is sorting/hashing activity.

    It stands to reason: consider a contrived example of two queries against a hashtable with no nonclustered indexes. The two queries have identical resultsets, but one of the resultsets is completely unordered and the second resultset is ordered by more than one of the selected columns. The second query would be expected to consume more CPU time, even though it would reference the same number of pages in the buffer pool.

    More about workspace memory, and how much of granted workspace has been used, in these posts:

    • http://sql-sasquatch.blogspot.com/2015/08/sql-server-grantedreservedstolen_4.html

    • http://sql-sasquatch.blogspot.com/2015/08/sql-server-workspace-memory-with-twist.html

    • http://sql-sasquatch.blogspot.com/2015/03/resource-governor-to-restrict-max-query.html


  1. SQL Server logical memory node alignment with physical NUMA nodes

    SQL Server (since incorporating its NUMA-aware strategies) by default creates a SQLOS memory node for each NUMA node on the server. As memory allocations grow, each allocation is controlled by one of the SQLOS memory nodes.

    Ideally, the SQLOS memory nodes are completely aligned with the physical NUMA nodes. That is to say, each SQLOS memory node contains memory from a single NUMA node, with no other SQLOS memory node also containing memory from that same NUMA node.

    However, that ideal situation is not always the case.

    The following CSS SQL Server Engineers blog post (also included in Kin's response) details behavior which can lead to persisting cross-NUMA node memory allocations for the SQLOS memory nodes. When this happens, the performance impact can be devastating.

    • http://blogs.msdn.com/b/psssql/archive/2012/12/13/how-it-works-sql-server-numa-local-foreign-and-away-memory-blocks.aspx

    There have been a few fixes for the particularly painful case of persistent cross-NUMA node reference. Probably others in addition to these two, as well:

    • FIX: Performance problems occur in NUMA environments during foreign page processing in SQL Server 2012 or SQL Server 2014

    • FIX: SQL Server performance issues in NUMA environments


  1. Spinlock contention during allocation of workspace memory

    This is where it starts to get fun. I've already described that sort and hash work in workspace memory consumes CPU but is not reflected in the bpool lookup numbers.

    Spinlock contention is another layer to this particular fun. When memory is stolen from the buffer pool and allocated for use against a query memory grant, memory access is serialized with a spinlock. By default, this takes place with a resource partitioned at the NUMA node level. So every query on the same NUMA node using workspace memory can potentially experience spinlock contention when stealing memory against grants. Very important to note: this isn't "one time per query" contention risk, as it would be if the point of contention were at the time of the actual grant. Rather, its when memory is stolen against the grant - so a query with a very large memory grant will have many opportunities for spinlock contention if it uses most of its grant.

    Trace flag 8048 does a great job of relieving this contention by further partitioning the resource at the core level.

    Microsoft says "consider trace flag 8048 if 8 or more cores per socket". But... it's not really how many cores per socket (as long as there are multiple), but rather how many opportunities for contention in the work being done on a single NUMA node.

    On the glued AMD processors (12 cores per socket, 2 NUMA nodes per socket) there were 6 cores per NUMA node. I saw a system with 4 of those CPUs (so eight NUMA nodes, 6 cores each) that was jammed up in spinlock convoy until trace flag 8048 was enabled.

    I've seen this spinlock contention drag down performance on VMs as small as 4 vCPUs. Trace flag 8048 did what it was supposed to when enabled on those systems.

    Considering that there are still some 4 core frequency optimized CPUs out there, with the right workload, they'd benefit from trace flag 8048 also.

    CMEMTHREAD waits accompany the type of spinlock contention that trace flag 8048 relieves. But a word of caution: CMEMTHREAD waits are a corroborating symptom, not root cause for this particular issue. I've seen systems with high CMEMTHREAD "wait starts" where trace flag 8048 and/or 9024 were delayed in deployment because accumulated CMEMTHREAD wait time was fairly low. With spinlocks, accumulated wait time is usually the wrong thing to look at. Rather, you want to look at wasted CPU time - represented primarily by the spins themselves, secondarily by the associated waits which represent potentially unnecessary context switches.

    • How It Works: CMemThread and Debugging Them

  1. Task assignment to schedulers

    On NUMA systems, connections are distributed to NUMA nodes (well - actually to the SQLOS scheduler groups associated with them) round-robin, assuming there aren't connection end points associated with particular NUMA nodes. If a session executes a parallel query, there is a strong preference to use workers from a single NUMA node. Hmmm... consider a 4 NUMA node server with a complex query broken into 4 paths, and default 0 MAXDOP. Even if the query used only MAXDOP worker threads, there would be 4 worker threads for each logical CPU on the NUMA node. But there are 4 paths in the complex plan - so each logical CPU on the NUMA node could have 16 workers on it - all for a single query!

    This is why sometimes you'll see one NUMA node working hard while others are loafing.

    There are a few other nuances to task assignment. But the main takeaway is that CPU busy won't necessarily be evenly distributed across the NUMA nodes. (Also good to realize that bpool page inserts (reads or first-page-writes) will go into the bpool in the SQLOS memory node associated with the scheduler the worker is on. And stolen pages will preferentially come from the "local" SQLOS memory node, too.

    I've found that bringing maxdop from 0 to no more than 8 is helpful. Depending on the workload profile (primarily imo on the number of concurrent expected potentially long-running queries), going all the way to MAXDOP=2 may be warranted.

    Adjusting the cost threshold for parallelism may also be helpful. Systems I work on tend to be consumed with high cost queries and rarely encounter a plan below 50 or 100, so I've had more traction by adjusting maxdop (oten at workload group level) than adjusting cost threshold.

    • SQL Server PLE with 8 NUMA nodes

      In this post, a combination of spinlock contention around workspace memory and uneven task distribution is discussed. See - these things really do all weave together :-)

    • 40 concurrent SQL Server parallel queries + (2 sockets * 10 cores per socket) = spinlock convoy


  1. Relevant data placement in the bpool

    This is the condition that I think is most intuitive when dealing with NUMA servers. Its also, most typically, not extremely significant to workload performance.

    What happens if the table is read into the bpool on NUMA node 3, and later a query on NUMA node 4 scans the table performing all bpool lookups across NUMA nodes?

    Linchi Shea has a great post on this performance impact:

    • http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx

    Accessing memory across NUMA nodes incurs a small amount of additional memory latency. I'm sure there are some workloads that need to eliminate that additional base memory latency for optimal performance - its not been an issue on the systems I work with.

    But - cross-node access also brings another point of transfer which can potentially saturate. If there is so much activity that memory bandwidth between NUMA nodes is saturated, memory latency between the nodes will increase. The same work will require additional CPU cycles.

    Again - I'm sure there are workloads such that memory bandwidth is a critical consideration. For my systems, though, the other considerations I am listing have been more significant.


  1. Physical memory placement

    This one is rare but when it matters, it really matters. On most servers, memory install is almost naturally going to balance across the NUMA nodes. But ins some cases, special attention is needed for balancing the memory across the nodes. Performance in some systems can be absolutely trashed if the memory was slotted in such a way that its not balanced. This is set-it-and-forget-it, though. Pretty rare to discover a problem like this after months of production service as opposed to after the first really busy day :-)


THE BIG FINISH!

Someone else made the point that poor plan choice, perhaps due to outdated stats, could result in the symptoms you've seen. That hasn't been the case in my experience. Poor plans can easily make a query take longer than expected - but usually because more logical IOs than necessary are being performed. Or due to spill to tempdb. Massive spill to tempdb should be evident when observing the server - and rather than high CPU one would expect measurable wait time for the spill-related disk writes.

Instead, if the situation you observed is NUMA-related, I'd expect it to be a combination of the factors enumerated above, mostly:

  1. use of workspace memory(which won't show up in logical IO counts)

  2. which may be cross-NUMA node due to persistent foreign memory condition(if this is the case, look for relevant fixes)

  3. and which may incur spinlock contention within the NUMA node each time an allocation is made against a grant(fix with T8048)

  4. and may be performed by workers on logical CPUs overloaded by other parallel query workers(adjust maxdop and/or cost threshold of parallelism as necessary)


(please update your question with coreinfo -v (a sysinternal utility) output to get a better context of your CPU / sockets and NUMA distribution)

We looked at overall CPU utilization which was around 60 percent. We did not look at socket specific CPU metrics. I/O metrics were average.

Seems to me that you are barking at the wrong tree. SQL Server is NUMA aware. There is much smaller performance penalty for doing cross NUMA memory access. You can also use this query to see how many NUMA nodes you have and which CPU and cores are assigned to which NUMA:

SELECT parent_node_id, scheduler_id, cpu_id
FROM sys.dm_os_schedulers WITH (NOLOCK) 
WHERE [status] = N'VISIBLE ONLINE';

Or just how many NUMA:

select COUNT(distinct Parent_node_id)
from sys.dm_os_schedulers
where [STATUS] = 'VISIBLE ONLINE'
    and Parent_node_ID < 64

We had queries with few logical reads taking more than 1 minute.

This normally happens when you have bad query plans generated due to outdated statistics. Make sure you have your stats updated and your indexes are defragmented properly.

Also, you need to set MAXDOP to a more sensible value to avoid worker thread starvation.

Set your cost threshold of parallelism away from default of 5 to a good starting value like 45 and then monitor that value and adjust it as per your environment.

If you are running a lot of adhoc queries, turn on (set to 1) optimize for ad hoc workloads to prevent plan cache bloating.

Use with caution : You can use T8048 if you are running SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node and there is a hotfix if you are on SQL server 2012 or 2014.

Highly recommend you to start collecting Wait stats information about your database server instance.

Refer : How It Works: SQL Server (NUMA Local, Foreign and Away Memory Blocks)