Use lxml to parse text file with bad header in Python

Given that there's a standard for these files, it's possible to write a proper parser rather than guessing at things, or hoping beautifulsoup gets things right. That doesn't mean it's the best answer for you, but it's certainly work looking at.

According to the standard at http://www.sec.gov/info/edgar/pdsdissemspec910.pdf what you've got (inside the PEM enclosure) is an SGML document defined by the provided DTD. So, first go to pages 48-55, extract the text there, and save it as, say, "edgar.dtd".

The first thing I'd do is install SP and use its tools to make sure that the documents really are valid and parseable by that DTD, to make sure you don't waste a bunch of time on something that isn't going to pan out.

Python comes with a validating SGML parser, sgmllib. Unfortunately, it was never quite finished, and it's deprecated in 2.6-2.7 (and removed in 3.x). But that doesn't mean it won't work. So, try it and see if it works.

If not, I don't know of any good alternatives in Python; most of the SGML code out there is in C, C++, or Perl. But you can wrap up any C or C++ library (I'd start with SP) pretty easily, as long as you're comfortable writing your own wrapped in C/Cython/boost-python/whatever or using ctypes. You only need to wrap up the top-level functions, not build a complete set of bindings. But if you've never done anything like this before, it's probably not the best time to learn.

Alternatively, you can wrap up a command-line tool. SP comes with nsgmls. There's another good tool written in perl with the same name (I think part of http://savannah.nongnu.org/projects/perlsgml/ but I'm not positive.) And dozens of other tools.

Or, of course, you could write the whole thing, or just the parsing layer, in perl (or C++) instead of Python.


You can easily get to the encapsulated text of the PEM (Privacy-Enhanced Message, specified in RFC 1421 ) by stripping the encapsulation boundries and separating everything in between into header and encapsulated text at the first blank line.

The SGML parsing is much more difficult. Here's an attempt that seems to work with a document from EDGAR:

from lxml import html

PRE_EB = "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
POST_EB = "-----END PRIVACY-ENHANCED MESSAGE-----"

def unpack_pem(pem_string):
    """Takes a PEM encapsulated message and returns a tuple
    consisting of the header and encapsulated text.  
    """

    if not pem_string.startswith(PRE_EB):
        raise ValueError("Invalid PEM encoding; must start with %s"
                         % PRE_EB)
    if not pem_string.strip().endswith(POST_EB):
        raise ValueError("Invalid PEM encoding; must end with %s"
                         % POST_EB)
    msg = pem_string.strip()[len(PRE_EB):-len(POST_EB)]
    header, encapsulated_text = msg.split('\n\n', 1)
    return (header, encapsulated_text)


filename = 'secdoc_htm.txt'
data = open(filename, 'r').read()

header, encapsulated_text = unpack_pem(data)

# Now parse the SGML
root = html.fromstring(encapsulated_text)
document = root.xpath('//document')[0]

metadata = {}
metadata['type'] = document.xpath('//type')[0].text.strip()
metadata['sequence'] = document.xpath('//sequence')[0].text.strip()
metadata['filename'] = document.xpath('//filename')[0].text.strip()

inner_html = document.xpath('//text')[0]

print(metadata)
print(inner_html)

Result:

{'filename': 'd371464d10q.htm', 'type': '10-Q', 'sequence': '1'}

<Element text at 80d250c>

Tags:

Python

Lxml