Why is iterating 2D array row major faster than column major?

Your array is actually a ragged array, so row major isn't entirely a factor.

You're seeing better performance iterating over columns then rows because the row memory is laid out linearly, which reading sequentially is easy for the cache predictor to predict, and you amortize the pointer dereference to the second dimension since it only needs to be done once per row.

When you iterate over the rows then columns, you incur a pointer dereference to the second dimension per iteration. So by iterating over rows, you're adding a pointer dereference. Aside from the intrinsic cost, it's bad for cache prediction.

If you want a true two-dimensional array, laid out in memory using row-major ordering, you would want...

int A[1000][1000];

This lays out the memory contiguously in row-major order, instead of one array of pointers to arrays (which are not laid out contiguously). Iterating over this array using row-major would still perform faster than iterating column-major because of spatial locality and cache prediction.


It obviously depends on the machine you're on but very generally speaking:

  1. Your computer stores parts of your program's memory in a cache that has a much smaller latency than main memory (even when compensating for cache hit time).

  2. C arrays are stored in a contiguous by row major order. This means if you ask for element x, then element x+1 is stored in main memory at a location directly following where x is stored.

  3. It's typical for your computer cache to "pre-emptively" fill cache with memory addresses that haven't been used yet, but that are locally close to memory that your program has used already. Think of your computer as saying: "well, you wanted memory at address X so I am going to assume that you will shortly want memory at X+1, therefore I will pre-emptively grab that for you and place it in your cache".

When you enumerate your array via row major order, you're enumerating it in such a way where it's stored in a contiguous manner in memory, and your machine has already taken the liberty of pre-loading those addresses into cache for you because it guessed that you wanted it. Therefore you achieve a higher rate of cache hits. When you're enumerating an array in another non-contiguous manner then your machine likely won't predict the memory access pattern you're applying, so it wont be able to pre-emptively pull memory addresses into cache for you, and you won't incur as many cache hits, so main memory will have to be accessed more frequently which is slower than your cache.

Also, this might be better suited for https://cs.stackexchange.com/ because the way your system cache behaves is implemented in hardware, and spatial locality questions seem better suited there.


The short answer is CPU caches. Scott Mayers explains it very clearly here