How to parse this custom log file in Python

Using @Joran Beasley's answer I came up with the following solution and it seems to work:

Main Points:

  • My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
  • Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
  • The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
  • A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.

Function to split up the log files.

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis

create a generator (Im on a generator bend today)

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith("2015"): #you might want a better check here
           if currentDict:
              yield currentDict
           currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
       else:
          currentDict["text"] += line
    yield currentDict

 with open("logfile.txt") as f:
    print list(generateDicts(f))

there may be a few minor typos... I didnt actually run this


You can get the fields you are looking for directly from the regex using groups. You can even name them:

>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
...     print found.groupdict()
... 
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'

Then create a date class where the constructor's kwargs match the group names. Use a little **magic to create an instance of the object directly from the regex groupdict and you are cooking with gas. In the constructor you can then figure out if 2016 is a leap year and Feb 29 exits.

-lrm