Can the Intel performance monitor counters be used to measure memory bandwidth?

The offcore response performance monitoring facility can be used to count all core-originated requests on the IDI from a particular core. The request type field can be used to count specific types of requests, such as demand data reads. However, to measure per-core memory bandwidth, the number of requests has to be somehow converted into bytes per second. Most requests are of the cache line size, i.e., 64 bytes. The size of other requests may not be known and could add to the memory bandwidth a number of bytes that is smaller or larger than the size of a cache line. These include cache line-split locked requests, WC requests, UC requests, and I/O requests (but these don't contribute to memory bandwidth), and fence requests that require all pending writes to be completed (MFENCE, SFENCE, and serializing instructions).

If you are only interested in cacheable bandwidth, then you can count the number of cacheable requests and multiply that by 64 bytes. This can be very accurate, assuming that cacheable cache line-split locked requests are rare. Unfortunately, writebacks from the L3 (or L4 if available) to memory cannot be counted by the offcore response facility on any of the current microarchitectures. The reason for this is that these writebacks are not core-originated and usually occur as a consequence for a conflict miss in the L3. So the request that missed in the L3 and caused the writeback can be counted, but the offcore response facility does not enable you to determine whether any request to the L3 (or L4) has caused a writeback or not. That's why it's impossible count writebacks to memory "per core."

In addition, offcore response events require a programmable performance counter that is one of 0, 1, 2, or 3 (but not 4-7 when hyptherhtreading is disabled).

Intel Xeon Broadwell support a number of Resource Director Technology (RDT) features. In particular, it supports Memory Bandwidth Monitoring (MBM), which is the only way to measure memory bandwidth accurately per core in general.

MBM has three advantages over offcore response:

  • It enables you to measure bandwidth of one or more tasks identified with a resource ID, rather than just per core.
  • It does not require one of the general-purpose programmable performance counters.
  • It can accurately measure local or total bandwidth, including writebacks to memory.

The advantage of offcore response is that it supports request type, supplier type, and snoop info fields.

Linux supports MBM starting with kernel version 4.6. On the 4.6 to 4.13, the MBM events are supported in perf using the following event names:

intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
intel_cqm_llc/total_bytes - total L3 external bytes sent

The events can also be accessed programmatically.

Starting with 4.14, the implementation of RDT in Linux has significantly changed.

On my BDW-E5 (dual socket) system running kernel version 4.16, I can see the byte counts of MBM using the following sequence of commands:

// Mount the resctrl filesystem.
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl

// Print the number of local bytes on the first socket.
cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes

// Print the number of total bytes on the first socket.
cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes

// Print the number of local bytes on the second socket.
cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes

// Print the number of total bytes on the second socket.
cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes

My understanding is that the number of bytes is counted since system reset.

Note that by default, the resource being monitored is the whole socket.

Unfortunately, most of RDT features including MBM turned out to be buggy on Skylake processors that support it. According to SKZ4 and SKX4:

Intel® Resource Director Technology (RDT) Memory Bandwidth Monitoring (MBM) does not count cacheable write-back traffic to local memory. This results in the RDT MBM feature under counting total bandwidth consumed.

That is why it's disabled by default on Linux when running on Skylake-X and Skylake-SP (which are the only Skylake processors that support MBM). You can enable MBM by adding the following parameter rdt=mbmtotal,mbmlocal to the kernel command line. There is no flag in some register to enable or disable MBM or any other RDT feature. Instead, this is tracked in some data structure in the kernel.

On the Intel Core 2 microarchitecture, memory bandwidth per core can be measured using the BUS_TRANS_MEM event as discussed here.


On some architectures, with perf you can access the uncore-PMU counters of the memory controller.

$ perf list
[...]
uncore_imc_0/cas_count_read/                       [Kernel PMU event]
uncore_imc_0/cas_count_write/                      [Kernel PMU event]
uncore_imc_0/clockticks/                           [Kernel PMU event]
[...]

Then:

$ perf -e "uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/" <program> <arguments>

will report the number of Bytes transmitting from the main memory to the cache in reading and write operations from memory controller #0. Divide that number by the time used and you have an approximation of the average memory bandwidth used.


Yes, this is possible, although it is not necessarily as straightforward as programming the usual PMU counters.

One approach is to use the programmable memory controller counters which are accessed via PCI space. A good place to start is by examining Intel's own implementation in pcm-memory at pcm-memory.cpp. This app shows you the per-socket or per-memory-controller throughput, which is suitable for some uses. In particular, the bandwidth is shared among all cores, so on a quiet machine you can assume most of the bandwidth is associated with the process under test, or if you wanted to monitor at the socket level it's exactly what you want.

The other alternative is to use careful programming of the "offcore repsonse" counters. These, as far as I know, relate to traffic between the L2 (the last core-private cache) and the rest of the system. You can filter by the result of the offcore response, so you can use a combination of the various "L3 miss" events and multiply by the cache line size to get a read and write bandwidth. The events are quite fine grained, so you can further break it down by the what caused the access in the first place: instruction fetch, data demand requests, prefetching, etc, etc.

The offcore response counters generally lag behind in support by tools like perf and likwid but at least recent versions seem to have reasonable support, even for client parts like SKL.


Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second.

It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc.

There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....