Merging and sorting log files in Python

As for the critical sorting function:

def sort_key(line):
    return datetime.strptime(line.split(']')[0], '[%a %b %d %H:%M:%S %Y')

This should be used as the key argument to sort or sorted, not as cmp. It is faster this way.

Oh, and you should have

from datetime import datetime

in your code to make this work.


First off, you will want to use the fileinput module for getting data from multiple files, like:

data = fileinput.FileInput()
for line in data.readlines():
    print line

Which will then print all of the lines together. You also want to sort, which you can do with the sorted keyword.

Assuming your lines had started with [2011-07-20 19:20:12], you're golden, as that format doesn't need any sorting above and beyond alphanum, so do:

data = fileinput.FileInput()
for line in sorted(data.readlines()):
    print line

As, however, you have something more complex you need to do:

def compareDates(line1, line2):
   # parse the date here into datetime objects
   NotImplemented
   # Then use those for the sorting
   return cmp(parseddate1, parseddate2)

data = fileinput.FileInput()
for line in sorted(data.readlines(), cmp=compareDates):
    print line

For bonus points, you can even do

data = fileinput.FileInput(openhook=fileinput.hook_compressed)

which will enable you to read in gzipped log files.

The usage would then be:

$ python yourscript.py access.log.1 access.log.*.gz

or similar.


You can do this

import fileinput
import re
from time import strptime

f_names = ['1.log', '2.log'] # names of log files
lines = list(fileinput.input(f_names))
t_fmt = '%a %b %d %H:%M:%S %Y' # format of time stamps
t_pat = re.compile(r'\[(.+?)\]') # pattern to extract timestamp
for l in sorted(lines, key=lambda l: strptime(t_pat.search(l).group(1), t_fmt)):
    print l,