Faster way to convert a vector of vectors to a single contiguous vector with opposite storage order

First of all, I cringe whenever something is qualified as "obviously". That word is often used to cover up a shortcoming in one's deductions.

But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses.

I'm not sure which is supposed to be obvious: that the row-wise conversion would be slow, or that it's slow because of cache misses. In either case, I find it not obvious. After all, there are two caching considerations here, aren't there? One for reading and one for writing? Let's look at the code from the reading perspective:

row_major_naive

for (size_t i = 0; i < n_col; ++i)
    for (size_t j = 0; j < n_row; ++j)
        out_vec[j * n_col + i] = vec[i][j];

Successive reads from vec are reads of contiguous memory: vec[i][0] followed by vec[i][1], etc. Very good for caching. So... cache misses? Slow? :) Maybe not so obvious.

Still, there is something to be gleaned from this. The claim is only wrong by claiming "obviously". There are non-locality issues, but they occur on the writing end. (Successive writes are offset by the space for 50 double values.) And empirical testing confirms the slowness. So maybe a solution is to flip on what is considered "obvious"?

row major flipped

for (size_t j = 0; j < n_row; ++j)
    for (size_t i = 0; i < n_col; ++i)
        out_vec[j * n_col + i] = vec[i][j];

All I did here was reverse the loops. Literally swap the order of those two lines of code then adjust the indentation. Now successive reads are potentially all over the place, as they read from different vectors. However, successive writes are now to contiguous blocks of memory. In one sense, we are in the same situation as before. But just like before, one should measure performance before assuming "fast" or "slow".

NaiveColumnMajor: 3.4 seconds
NaiveRowMajor: 7.7 seconds
FlippedRowMajor: 4.2 seconds
BlockingRowMajor: 4.4 seconds
BlockingColumnMajor: 3.9 seconds

Still slower than the naive column major conversion. However, this approach is not only faster than naive row major, but it's also faster than blocking row major. At least on my computer (using gcc -O3 and obviously :P iterating thousands of times). Mileage may vary. I don't know what the fancy profiling tools would say. The point is that sometimes simpler is better.

For funsies I did a test where the dimensions are swapped (changing from 50 vectors of 4000 elements to 4000 vectors of 50 elements). All methods got hurt this way, but "NaiveRowMajor" took the biggest hit. Worth noting is that "flipped row major" fell behind the blocking version. So, as one might expect, the best tool for the job depends on what exactly the job is.

NaiveColumnMajor: 3.7 seconds
NaiveRowMajor: 16 seconds
FlippedRowMajor: 5.6 seconds
BlockingRowMajor: 4.9 seconds
BlockingColumnMajor: 4.5 seconds

(By the way, I also tried the flipping trick on the blocking version. The change was small -- around 0.2 -- and opposite of flipping the naive version. That is, "flipped blocking" was slower than "blocking" for the question's 50-of-4000 vectors, but faster for my 4000-of-50 variant. Fine tuning might improve the results.)


Update: I did a little more testing with the flipping trick on the blocking version. This version has four loops, so "flipping" is not as straight-forward as when there are only two loops. It looks like swapping the order of the outer two loops is bad for performance, while swapping the inner two loops is good. (Initially, I had done both and gotten mixed results.) When I swapped just the inner loops, I measured 3.8 seconds (and 4.1 seconds in the 4000-of-50 scenario), making this the best row-major option in my tests.

row major hybrid

for (size_t l = 0; l < n_col; l += block_side)
    for (size_t i = 0; i < n_row; ++i)
        for (size_t j = l; j < l + block_side && j < n_col; ++j)
            out_vec[i * n_col + j] = vec[j][i];

(After swapping the inner loops, I merged the middle loops.)

As for the theory behind this, I would guess that this amounts to trying to write one cache block at a time. Once a block is written, try to re-use vectors (the vec[j]) before they get ejected from the cache. After you exhaust those source vectors, move on to a new group of source vectors, again writing full blocks at a time.