Enabling NUMA for Intel Core i7

For 64-bit this is recommended if the system is Intel Core i7
(or later), AMD Opteron, or EM64T NUMA.

First, note that Intel Core i7 is just a marketing designation, and the phrase Intel Core i7 (or later) is very vague. So what could it mean?

The Linux kernel Kconfig help text edits mentioning an Intel Core 7i, then corrected to Intel Core i7, were done in November of 2008. The commit log reads:

x86: update CONFIG_NUMA description
Impact: clarify/update CONFIG_NUMA text

CONFIG_NUMA description talk about a bit old thing.
So, following changes are better.

 o CONFIG_NUMA is no longer EXPERIMENTAL

 o Opteron is not the only processor of NUMA topology on x86_64 no longer,
   but also Intel Core7i has it.

It can reasonably only refer to Intel Core i7 CPUs released or announced per spec by that time. That would be the Bloomfield processors, based on the Nehalem microarchitecture, which moved the memory controller from the Northbridge onto the CPU (which AMD had done in 2003 with the Opteron/AMD64) and introduced QuickPath Interconnect/QPI (as pendant to AMD's HyperTransport) for CPU/CPU and CPU/IOH (IO hub, ex-Northbridge) interconnection.

The Bloomdale i7 CPUs were the first entries in the new Core i{3,5,7} naming scheme. So when that Linux doc text was written, i7 didn't specifically refer to the Core i7 as opposed to i5 (first in 09/2009) or i3 (first in 01/2010), but in all likelihood to the new Nehalem microarchitecture with its integrated memory controller and QPI.

There's an Intel press release from 11/2008 on the i7 (Intel Launches Fastest Processor on the Planet) that states that the Core i7 processor more than doubles the memory bandwidth of previous Intel "Extreme" platforms, but doesn't mention NUMA at all.

The reason is, I think, that NUMA doesn't matter for desktop PCs, not even for "extreme" ones.

NUMA matters for expensive servers that have several CPU sockets (not just several cores on one socket) with dedicated physical memory access lanes (not just one memory controller), so that each CPU has its dedicated local memory, which is "closer" to it than the memory of the other CPUs. (Think 8 sockets, 64 cores, 256 GB RAM.) NUMA means that a CPU can also access remote memory (the local memory of another CPU) in addition to its own local memory, albeit at a higher cost. NUMA is the synthesis of a shared memory architecture like SMP, where all memory is equally available to all cores, and a distributed memory architecture like MPP (Massively Parallel Processing), that gives each node a dedicated block of memory. It is MPP, but it looks like SMP to the application.

Desktop motherboards don't have dual sockets and Intel desktop CPUs including extreme i7 editions lack the additional QPI link for dual socket configuration.

Check the Wikipedia QPI article to see how QPI is relevant to NUMA:

In its simplest form on a single-processor motherboard, a single QPI is used to connect the processor to the IO Hub (e.g., to connect an Intel Core i7 to an X58). In more complex instances of the architecture, separate QPI link pairs connect one or more processors and one or more IO hubs or routing hubs in a network on the motherboard, allowing all of the components to access other components via the network. As with HyperTransport, the QuickPath Architecture assumes that the processors will have integrated memory controllers, and enables a non-uniform memory access (NUMA) architecture.

[…]

Although some high-end Core i7 processors expose QPI, other "mainstream" Nehalem desktop and mobile processors intended for single-socket boards (e.g. LGA 1156 Core i3, Core i5, and other Core i7 processors from the Lynnfield/Clarksfield and successor families) do not expose QPI externally, because these processors are not intended to participate in multi-socket systems. However, QPI is used internally on these chips […]

The way an Intel Nehalem CPU on a multi-socket server board makes non-local memory access is via QPI. Also in the article on NUMA:

Intel announced NUMA compatibility for its x86 and Itanium servers in late 2007 with its Nehalem and Tukwila CPUs. Both CPU families share a common chipset; the interconnection is called Intel Quick Path Interconnect (QPI). AMD implemented NUMA with its Opteron processor (2003), using HyperTransport.

Check this report back from 11/2008 to see that Intel disabled one of the two QPI links on the i7, thus disabling dual-socket configuration, where NUMA applies:

This first, high-end desktop implementation of Nehalem is code-named Bloomfield, and it's essentially the same silicon that should go into two-socket servers eventually. As a result, Bloomfield chips come with two QPI links onboard, as the die shot above indicates. However, the second QPI link is unused. In 2P servers based on this architecture, that second interconnect will link the two sockets, and over it, the CPUs will share cache coherency messages (using a new protocol) and data (since the memory subsystem will be NUMA)—again, very similar to the Opteron.

So I've been straying away from your question relating my Google research results … You're asking why the Linux docs have started to recommend turning it on in late 2008? Not sure this question has a provably correct answer … We would have to ask the doc writer. Turning NUMA on doesn't benefit desktop CPU users, but doesn't significantly hurt them either, while helping multi-socket users, so why not? This could have been the rationale. Found that reflected in a discussion about disabling NUMA on the Arch Linux tracker (FS#31187 - [linux] - disable NUMA from config files).

The doc author might also have thought of the NUMA potential of the Nehalem architecture of which, when the doc was written, the 11/2008 Core i7 processors (920, 940, 965) were the only representatives; the first Nehalem chips for which NUMA would have really made sense are probably the Q1/2009 Xeon processors with dual QPI link such as the Xeon E5520.


I think this picture explains enough:

                  enter image description here

  • socket or numa node is a collection of cores with a local access to memory. Each socket contains 1 or more cores. Note that this does not necessarily refer to a physical socket, but rather to the memory architecture of the machine, which will depend on your chip vendor.

  • processor core (cpu core, logical processor) refers to a single processing unit capable of performing computations.

So the above indicates that you would need multiple processors in the machine to leverage NUMA architecture.

You can have compiled NUMA support in the kernel and run it on single processor machine. It's similar like with SMP support. It's compiled in as well but when the kernel detects that there is single processor in the system it will not use it (disable it). The same holds for NUMA. You can check dmesg kernel ring buffer or /var/log/dmesg file for related messages:

NUMA - single processor (or NUMA disabled) X multi processor:

No NUMA configuration found
NUMA: Allocated memnodemap from b000 - b440

SMP - single processor X multi processor:

SMP: Allowing 1 CPUs, 0 hotplug CPUs
SMP: Allowing 32 CPUs, 0 hotplug CPUs

References

  • NUMA (Non-Uniform Memory Access): An Overview

I've been researching the same thing for my desktop PC while building my kernel on my own. I have decided to disable NUMA after much research. My CPU is a Core i7 3820 which has 8 processors with HT. This page helped my come to my decision.

disable NUMA from config files

In summary, NUMA is only worthwhile if you have more than 1 CPU socket (regardless of cores). There is a very small hit to processing power on 1 CPU Socket machines even with multiple cores, but it is hardly noticeable so most distributions leave it enabled as it will provide a huge benefit to servers and machines with more than 1 socket.