Why is NumPy sometimes slower than NumPy + plain Python loop?

It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.

Consider these variations, which only differ in which axis we are iterating on:

def f_2_0(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_2_1(arr):
    ans = 0
    for val in range(arr.shape[1]):
        ans += np.sum(arr[:, val, :] > 0)
    return ans

def f_2_2(arr):
    ans = 0
    for val in range(arr.shape[2]):
        ans += np.sum(arr[:, :, val] > 0)
    return ans

And the results on my laptop:

%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can see that f_2_1 almost as fast as f_1, which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0). The explanation for how exactly caching affects the timing is in the other answer.


This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:

np.sum(arr > 0)

It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.

Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.

Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.

Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:

a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp

a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.

Conclusions

Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).

Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.

If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.