why readline() is much slower than readlines() in Python?

Just for fun, I wrote a bunch of functions that iterate over a file and put each line into a list:

#!/usr/bin/python

def readlines():
    with open("sorted_output.txt") as f:
        line = f.readlines()

def readline():
    with open("sorted_output.txt") as f:
        line = f.readline()
        lines = []
        while line:
            lines.append(line)
            line = f.readline()

def iterate():
    with open("sorted_output.txt") as f:
        lines = []
        for line in f:
            lines.append(line)

def comprehension():
    with open("sorted_output.txt") as f:
        lines = [line for line in f]

Here's is how each of them performed on a file with 69,073 lines, using Python 2.6 (note, these results may be different on newer versions of Python):

dano@hostname:~> python -mtimeit -s 'import test' 'test.readline()'
10 loops, best of 3: 78.3 msec per loop
dano@hostname:~> python -mtimeit -s 'import test' 'test.readlines()'
10 loops, best of 3: 21.6 msec per loop
dano@hostname:~> python -mtimeit -s 'import test' 'test.comprehension()'
10 loops, best of 3: 23.6 msec per loop
dano@hostname:~> python -mtimeit -s 'import test' 'test.iterate()'
10 loops, best of 3: 33.3 msec per loop

So, readlines() is the fastest here, though iterating over each line using a list comprehension almost matches it. My guess is the speed differences between each approach is mostly the result of the high overhead of function calls in Python (the more function calls required, the slower the approach), but there may be other factors, as well. Hopefully someone more knowledgeable than me can comment on that.

In addition to performance, one other important consideration when deciding which of these methods to use is memory cost. Using readlines() will read the entire file into memory at once. If you're dealing with a huge file, it could cause serious performance issues or crash the program altogether if you were to try to read the entire thing into memory at once. In those cases, you'd want to use the approach in iterate(), since it only reads one line into memory at a time. In cases where you're just doing some kind of processing on each line and then throwing it away, this is usually the way to go, even if it is slightly slower than readlines(), because you don't take the same memory hit. Of course, if your goal in the end is to store the entire file in a Python list, you're going to pay that memory cost anyway, so readlines() will work fine.

Tags:

Python