How to parse broken HTML with LXML

Don't just construct that parser, use it (as per the example you link to):

>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser)
>>> tree
<lxml.etree._ElementTree object at 0x2fd8e60>

Or use lxml.html as a shortcut:

>>> from lxml import html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> html.fromstring(broken_html)
<Element html at 0x2dde650>

lxml allows you load a broken xml by creating a parser instance with recover=True

etree.HTMLParser(recover=True)

You could use the same technique when creating the parser.

How to parse broken HTML with LXML

Tags:

Python

Lxml

Related

Recent Posts