What is the difference between __ldg() intrinsic and a normal execution?

From the CUDA C Programming Guide

Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.

...

Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.

The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).

Tags:

Cuda