Converting a text document with special format to Pandas DataFrame

Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.

import re
import pandas as pd

SEP_RE = re.compile(r":\s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)\s+(?P<weight>\d+\.\d+)", re.I)


def parse(filepath: str):
    def _parse(filepath):
        with open(filepath) as f:
            for line in f:
                id, rest = SEP_RE.split(line, maxsplit=1)
                for match in DATA_RE.finditer(rest):
                    yield [int(id), match["term"], float(match["weight"])]
    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
...                   columns=["Id", "Term", "weight"])
>>> 
>>> df
   Id     Term  weight
0   1    frack   0.733
1   1    shale   0.700
2  10    space   0.645
3  10  station   0.327
4  10     nasa   0.258
5   4   celebr   0.262
6   4    bahar   0.345

>>> df.dtypes
Id          int64
Term       object
weight    float64
dtype: object

Walkthrough

SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.

After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,\n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,\n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as \w.


You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
    list(
        chain.from_iterable(
            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
        )
    ), 
    columns=["Id", "Term", "weight"]
)

print(df)
#  Id     Term weight
#0  4    frack  0.733
#1  4    shale  0.700
#2  4    space  0.645
#3  4  station  0.327
#4  4     nasa  0.258
#5  4   celebr  0.262
#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
    [
        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
        map(lambda x: x.strip(" ,").split(":"), text.splitlines())
    ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
#  ('10', 'station', '0.327'),
#  ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

Tags:

Python

Pandas