Confusion about different running times of two algorithms in C

On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).

By "skipping" data (reading far from the previous read), the CPU has to read the memory again.

That's why your first snippet is faster.

For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.

To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.

for (i=0; i < ROWS; i++) {
    const long *row = m[i];
    for (j=0; j < COLS; j++) {
        sum += row[j];

If you cannot do this, that means that your data is wrongly oriented.

C uses row-major ordering to store multidimensional arrays, as documented in § Array subscripting, paragraph 3 of the C Standard:

Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).

Emphasis mine.

Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:

row and column major ordering

The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.

There are some other languages that use column-major ordering for multidimensional arrays; among them are R, FORTRAN and MATLAB. If you wrote equivalent code in these languages you would observe faster output with sum2.

Computers generally use cache to help speed up access to main memory.

The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.

Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.

The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.

So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.

Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.

Related Resource: Memory layout of multi-dimensional arrays

This is an issue with the cache.

The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.