What is the meaning of "non temporal" memory accesses in x86

Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.

When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.

For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.

Source: http://lwn.net/Articles/255364/


Espo is pretty much bang on target. Just wanted to add my two cents:

The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.

I am a little skeptical about the hand-coded assembly that uses the cache control instructions. In my experience these things lead to more evil bugs than any effective performance increases.


According to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture, "Programming with Intel Streaming SIMD Extensions (Intel SSE)" chapter:

Caching of Temporal vs. Non-Temporal Data

Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as "polluting the caches". The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.

Description of non-temporal load and store instructions. Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference

LOAD (MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint)

Loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type [...]

[...] the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.

Note that, as Peter Cordes comments, it's not useful on normal WB (write-back) memory on current processors because the NT hint is ignored (probably because there are no NT-aware HW prefetchers) and the full strongly-ordered load semantics apply. prefetchnta can be used as a pollution-reducing load from WB memory

STORE (MOVNTDQ—Store Packed Integers Using Non-Temporal Hint)

Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory.

[...] the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.

Using the terminology defined in Cache Write Policies and Performance, they can be considered as write-around (no-write-allocate, no-fetch-on-write-miss).

Finally, it may be interesting to review John McAlpin notes about non-temporal stores.

Tags:

Assembly

X86

Sse