How to use Cleaner, lxml.html without returning div tag?

lxml expects your html to have a tree structure, ie a single root node. If it does not have one, it adds it.


Cleaner always wraps the result in an element. A good solution is to parse the HTML manually and send the resulting document object to cleaner- then the result is also a document object, and you can use text_content to extract the text from the root.

from lxml.html import document_fromstring
from lxml.html.clean import Cleaner
evil = "<script>malignus script</script><b>bold text</b><i>italic 
text</i>"
doc = document_fromstring(evil)
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
              page_structure=True)
print cleaner.clean_html(doc).text_content()

This can also be done as a one liner