Should I prefer stride one memory access for either reading or writing?

Access pattern, that you name "writes stride one" (y[i]=x[q(i)]), is usually faster.

If memory is cached and your data pieces are smaller than cache line, this access pattern requires less memory bandwidth.

It is usual for modern processors to have more load execution units, than store units. And next Intel architecture, named Haswell, supports only GATHER instruction, while SCATTER is not yet in their plans. All this is also in favor of "writes stride one" pattern.

Working in a threaded environment does not change this.


I'd like to share results of my simple benchmarks. Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:

A = transpose(B)

Algorithms:

  1. Two nested loops such that reads are contiguous and writes are strided.
  2. Two nested loops such that reads are strided and writes are contiguous.
  3. Sequential MKL's mkl_domatcopy.

Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects.

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:

Intel Core i7-4770

Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:

Intel Xeon E5-2650 v3

Numbers and C++ source code