Why is a `for` loop so much faster to count True values?

@MarkusMeskanen's answer has the right bits – function calls are slow, and both genexprs and listcomps are basically function calls.

Anyway, to be pragmatic:

Using str.count(c) is faster, and this related answer of mine about strpbrk() in Python could make things faster still.

def count_even_digits_spyr03_count(n):
    s = str(n)
    return sum(s.count(c) for c in "02468")


def count_even_digits_spyr03_count_unrolled(n):
    s = str(n)
    return s.count("0") + s.count("2") + s.count("4") + s.count("6") + s.count("8")

Results:

string length: 502
count_even_digits_spyr03_list 0.04157966522
count_even_digits_spyr03_sum 0.05678154459
count_even_digits_spyr03_for 0.036128606150000006
count_even_digits_spyr03_count 0.010441866129999991
count_even_digits_spyr03_count_unrolled 0.009662931009999999

If we use dis.dis(), we can see how the functions actually behave.

count_even_digits_spyr03_for():

  7           0 LOAD_CONST               1 (0)
              3 STORE_FAST               0 (count)

  8           6 SETUP_LOOP              42 (to 51)
              9 LOAD_GLOBAL              0 (str)
             12 LOAD_GLOBAL              1 (n)
             15 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             18 GET_ITER
        >>   19 FOR_ITER                28 (to 50)
             22 STORE_FAST               1 (c)

  9          25 LOAD_FAST                1 (c)
             28 LOAD_CONST               2 ('02468')
             31 COMPARE_OP               6 (in)
             34 POP_JUMP_IF_FALSE       19

 10          37 LOAD_FAST                0 (count)
             40 LOAD_CONST               3 (1)
             43 INPLACE_ADD
             44 STORE_FAST               0 (count)
             47 JUMP_ABSOLUTE           19
        >>   50 POP_BLOCK

 11     >>   51 LOAD_FAST                0 (count)
             54 RETURN_VALUE

We can see that there's only one function call, that's to str() at the beginning:

9 LOAD_GLOBAL              0 (str)
...
15 CALL_FUNCTION            1 (1 positional, 0 keyword pair)

Rest of it is highly optimized code, using jumps, stores, and inplace adding.

What comes to count_even_digits_spyr03_sum():

 14           0 LOAD_GLOBAL              0 (sum)
              3 LOAD_CONST               1 (<code object <genexpr> at 0x10dcc8c90, file "test.py", line 14>)
              6 LOAD_CONST               2 ('count2.<locals>.<genexpr>')
              9 MAKE_FUNCTION            0
             12 LOAD_GLOBAL              1 (str)
             15 LOAD_GLOBAL              2 (n)
             18 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             21 GET_ITER
             22 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             25 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             28 RETURN_VALUE

While I can't perfectly explain the differences, we can clearly see that there are more function calls (probably sum() and in(?)), which make the code run much slower than executing the machine instructions directly.


sum is quite fast, but sum isn't the cause of the slowdown. Three primary factors contribute to the slowdown:

  • The use of a generator expression causes overhead for constantly pausing and resuming the generator.
  • Your generator version adds unconditionally instead of only when the digit is even. This is more expensive when the digit is odd.
  • Adding booleans instead of ints prevents sum from using its integer fast path.

Generators offer two primary advantages over list comprehensions: they take a lot less memory, and they can terminate early if not all elements are needed. They are not designed to offer a time advantage in the case where all elements are needed. Suspending and resuming a generator once per element is pretty expensive.

If we replace the genexp with a list comprehension:

In [66]: def f1(x):
   ....:     return sum(c in '02468' for c in str(x))
   ....: 
In [67]: def f2(x):
   ....:     return sum([c in '02468' for c in str(x)])
   ....: 
In [68]: x = int('1234567890'*50)
In [69]: %timeit f1(x)
10000 loops, best of 5: 52.2 µs per loop
In [70]: %timeit f2(x)
10000 loops, best of 5: 40.5 µs per loop

we see an immediate speedup, at the cost of wasting a bunch of memory on a list.


If you look at your genexp version:

def count_even_digits_spyr03_sum(n):
    return sum(c in "02468" for c in str(n))

you'll see it has no if. It just throws booleans into sum. In constrast, your loop:

def count_even_digits_spyr03_for(n):
    count = 0
    for c in str(n):
        if c in "02468":
            count += 1
    return count

only adds anything if the digit is even.

If we change the f2 defined earlier to also incorporate an if, we see another speedup:

In [71]: def f3(x):
   ....:     return sum([True for c in str(x) if c in '02468'])
   ....: 
In [72]: %timeit f3(x)
10000 loops, best of 5: 34.9 µs per loop

f1, identical to your original code, took 52.2 µs, and f2, with just the list comprehension change, took 40.5 µs.


It probably looked pretty awkward using True instead of 1 in f3. That's because changing it to 1 activates one final speedup. sum has a fast path for integers, but the fast path only activates for objects whose type is exactly int. bool doesn't count. This is the line that checks that items are of type int:

if (PyLong_CheckExact(item)) {

Once we make the final change, changing True to 1:

In [73]: def f4(x):
   ....:     return sum([1 for c in str(x) if c in '02468'])
   ....: 
In [74]: %timeit f4(x)
10000 loops, best of 5: 33.3 µs per loop

we see one last small speedup.


So after all that, do we beat the explicit loop?

In [75]: def explicit_loop(x):
   ....:     count = 0
   ....:     for c in str(x):
   ....:         if c in '02468':
   ....:             count += 1
   ....:     return count
   ....: 
In [76]: %timeit explicit_loop(x)
10000 loops, best of 5: 32.7 µs per loop

Nope. We've roughly broken even, but we're not beating it. The big remaining problem is the list. Building it is expensive, and sum has to go through the list iterator to retrieve elements, which has its own cost (though I think that part is pretty cheap). Unfortunately, as long as we're going through the test-digits-and-call-sum approach, we don't have any good way to get rid of the list. The explicit loop wins.

Can we go further anyway? Well, we've been trying to bring the sum closer to the explicit loop so far, but if we're stuck with this dumb list, we could diverge from the explicit loop and just call len instead of sum:

def f5(x):
    return len([1 for c in str(x) if c in '02468'])

Testing digits individually isn't the only way we can try to beat the loop, too. Diverging even further from the explicit loop, we can also try str.count. str.count iterates over a string's buffer directly in C, avoiding a lot of wrapper objects and indirection. We need to call it 5 times, making 5 passes over the string, but it still pays off:

def f6(x):
    s = str(x)
    return sum(s.count(c) for c in '02468')

Unfortunately, this is the point when the site I was using for timing stuck me in the "tarpit" for using too many resources, so I had to switch sites. The following timings are not directly comparable with the timings above:

>>> import timeit
>>> def f(x):
...     return sum([1 for c in str(x) if c in '02468'])
... 
>>> def g(x):
...     return len([1 for c in str(x) if c in '02468'])
... 
>>> def h(x):
...     s = str(x)
...     return sum(s.count(c) for c in '02468')
... 
>>> x = int('1234567890'*50)
>>> timeit.timeit(lambda: f(x), number=10000)
0.331528635986615
>>> timeit.timeit(lambda: g(x), number=10000)
0.30292080697836354
>>> timeit.timeit(lambda: h(x), number=10000)
0.15950968803372234
>>> def explicit_loop(x):
...     count = 0
...     for c in str(x):
...         if c in '02468':
...             count += 1
...     return count
... 
>>> timeit.timeit(lambda: explicit_loop(x), number=10000)
0.3305045129964128