Why is my Sieve of Eratosthenes so slow?

The Sieve of Eratosthenes looks like this:

def sieve(n):
    primality_flags = [True]*(n+1)
    primality_flags[0] = primality_flags[1] = False
    primes = []
    for i, flag in enumerate(primality_flags):
        if flag:
            primes.append(i)
            for j in xrange(2*i, n+1, i):
                primality_flags[i] = False
    return primes

It processes each number once when the outer loop reaches it, and once for every prime that divides it. About 1/2 the numbers are divisible by 2, about 1/3 are divisible by 3, and so on; asymptotically speaking, the average number of times each number will be processed is 1 + the sum of the reciprocals of the primes up to n. This sum is about log(log(n)), so the sieve has asymptotic time complexity O(n*log(log(n))), assuming arithmetic is constant time. This is really good.


Your function doesn't do that. Your filter goes over every element in numbers, regardless of whether it's divisible by prime. Each element is processed for every prime up until the first prime that divides it, and processing the prime p removes about 1/p of the elements of numbers. Letting the sequence of primes be p[0], p[1], p[2], etc., and letting the sequence of sizes of numbers be n[0], n[1], n[2], etc., we have the following approximate recurrence:

n[0] = upperBound - 1
n[1] = n[0] * (p[0]-1)/p[0]
n[2] = n[1] * (p[1]-1)/p[1]
...
n[k+1] = n[k] * (p[k]-1)/p[k]

and your algorithm takes time roughly proportional to the sum of the n values up until numbers is empty. I haven't analyzed the behavior of that series, but calculations show the growth is much worse than O(n*log(log(n))). (EDIT: An analysis I didn't come up with when composing this answer says it's O((n/log(n))^2).)


Running cProfile shows that most of the time is spent in the filter. Replacing the filter with a list comprehension speeds things up a by about a factor of 2.

numbers = [n for n in numbers if n%prime != 0]

But this doesn't really fix the main problem, which is that you are are recreating the list of numbers with each iteration, and that is slow. The faster implementations http://groups.google.com/group/comp.lang.python/msg/f1f10ced88c68c2d just mark the non primes by replacing them with 0 or similar.