Why does using MFENCE with store instruction block prefetching in L1 cache?

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:

wrmsr -a 0x1a4 "$((2#1110))"

If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS and L2.RFO_ALL even without the mfence.

First, it is important to note that the L2_RQSTS.RFO_* events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:

name      umask
RFO_MISS   0x22
RFO_HIT    0x42
ALL_RFO    0xE2

Note that none of the umask values have the 0x10 bit which indicates that events which originate from the L2 streamer should be tracked.

It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10 version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS and RFO_HIT and RFO_ALL will miss it.

It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.

The mfence just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT count.

I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.

Here are some useful perf commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/

and with them included:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/

I ran these against this code (with the sleep(1) lining up with the --delay=1000 command passed to perf to exclude the init code):

#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>

typedef struct _object{
  int value;
  char pad[60];
} object;

int main() {
    volatile object * array;
    int arr_size = 1000000;
    array = (object *) malloc(arr_size * sizeof(object));

    for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        _mm_clflush((const void*)&array[i]);
    }
    _mm_mfence();

    sleep(1);
    // printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);

    int tmp;
    for(int i=0; i < arr_size-105; i++){
        array[i].value = 2;
        //tmp = array[i].value;
        // _mm_mfence();
    }
}

Regarding the case with store operations, I have run the same loop on a Haswell processor in four different configurations:

MFENCE + E: There is an MFENCE instruction after the store. All hardware prefetchers are enabled.
E : There is no MFENCE. All hardware prefetchers are enabled.
MFENCE + D: There is an MFENCE instruction after the store. All hardware prefetchers are disabled.
D : There is no MFENCE. All hardware prefetchers are disabled.

The results are shown below, which are normalized by the number of stores (each store is to a different cache line). They are very deterministic across multiple runs.

                                 | MFENCE + E |      E     | MFENCE + D |      D     |
    L2_RQSTS.ALL_RFO             |    0.90    |    0.62    |    1.00    |    1.00    |
    L2_RQSTS.RFO_HIT             |    0.80    |    0.12    |    0.00    |    0.00    |
    L2_RQSTS.RFO_MISS            |    0.10    |    0.50    |    1.00    |    1.00    |
    OFFCORE_REQUESTS.DEMAND_RFO  |    0.20    |    0.88    |    1.00    |    1.00    |
    PF_L3_RFO                    |    0.00    |    0.00    |    0.00    |    0.00    |
    PF_RFO                       |    0.80    |    0.16    |    0.00    |    0.00    |
    DMND_RFO                     |    0.19    |    0.84    |    1.00    |    1.00    |

The first four events are core events and the last three events are off-core response events:

L2_RQSTS.ALL_RFO: Occurs for each RFO request to the L2. This includes RFO requests from stores that have retired or otherwise, and RFO requests from PREFETCHW. For the cases where the hardware prefetchers are enabled, the event count is less than what is expected, which is a normalized one. One can think of two possible reasons for this: (1) somehow some of the RFOs hit in the L1, and (2) the event is undercounted. We'll try to figure out which is it by examining the counts of the other events and recalling what we know about the L1D prefetchers.
L2_RQSTS.RFO_HIT and L2_RQSTS.RFO_MISS: Occur for an RFO that hits or misses in the L2, respectively. In all configurations, the sum of the counts of these events is exactly equal to L2_RQSTS.ALL_RFO.
OFFCORE_REQUESTS.DEMAND_RFO: The documentation of this event suggests that it should be the same as L2_RQSTS.RFO_MISS. However, observe that the sum of OFFCORE_REQUESTS.DEMAND_RFO and L2_RQSTS.RFO_HIT is actually equal to one. Thus, it's possible that L2_RQSTS.RFO_MISS undercounts (and so L2_RQSTS.ALL_RFO does too). In fact, this is the most likely explanation because the Intel optimization manual (and other Intel documents) say that only the L2 streamer prefetcher can track stores. The Intel performance counter manual mentions "L1D RFO prefetches" in the description of L2_RQSTS.ALL_RFO. These prefetches probably refer to RFOs from stores that have not retired yet (see the last section of the answer to Why are the user-mode L1 store miss events only counted when there is a store initialization loop?).
PF_L3_RFO: Occurs when an RFO from the L2 streamer prefetcher is triggered and the target cache structure is the L3 only. All counts of this event are zero.
PF_RFO: Occurs when an RFO from the L2 streamer prefetcher is triggered and the target cache structure is the L2 and possibly the L3 (if the L3 is inclusive, then the line will also be filled into the L3 as well). The count of this event is close to L2_RQSTS.RFO_HIT. In the MFENCE + E case, it seems that 100% of the RFOs have completed on time (before the demand RFO has reached the L2). In the E case, 25% of prefetches did not complete on time or the wrong lines were prefetched. The reason why the number of RFO hits in the L2 is larger in the MFENCE + E case compared to the E case is that the MFENCE instruction delays later RFOs, thereby keeping most of the L2's super queue entries available for the L2 streamer prefetcher. So MFENCE really enables the L2 streamer prefetcher to perform better. Without it, there would be many in-flight demand RFOs at the L2, leaving a small number of super queue entries for prefetching.
DMND_RFO: The same as OFFCORE_REQUESTS.DEMAND_RFO, but it looks like it may undercount a little.

I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.

Regarding the case with load operations, in my experience, MFENCE (or any other fence instruction) has no impact on the behavior of the hardware prefetchers. The true count of the MEM_LOAD_RETIRED.L1_HIT event here is actually very small (< 2000). Most of the events being counted are from MFENCE itself, not the loads. MFENCE (and SFENCE) require sending a fence request all the way to the memory controller to ensure that all pending stores have reached the global observation point. A fence request is not counted as an RFO event, but it may get counted as multiple events, including L1_HIT. For more information on this and similar observations, see my blog post: An Introduction to the Cache Hit and Miss Performance Monitoring Events.

Why does using MFENCE with store instruction block prefetching in L1 cache?

Tags:

Performance

X86

Intel

Prefetch

Memory Barriers

Related

Recent Posts