How to remove tags that have no content

If your focus is keeping just textual elements, how about the following approach? This removes all elements which contain no text, for example images. Add any tags such as br or img that must not be removed.

It really depends on what structure you want to remain.

from bs4 import BeautifulSoup

html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
<i>hello world</i>
"""
soup = BeautifulSoup(html_object, "lxml")

for x in soup.find_all():
    if len(x.get_text(strip=True)) == 0 and x.name not in ['br', 'img']:
        x.extract()

print(soup)

Giving:

<html><body>
<i>hello world</i>
</body></html>

Here is a way to remove any tag which has no content:

>>> html = soup.findAll(lambda tag: tag.string is None)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())

And output is an empty string for your example, since there's no tag has a content.


If you only want to remove tag which has no content, but don't remove tag which has attributes. Like only remove <o:p></o:p>, there's another way:

>>> html = soup.findAll(lambda tag: not tag.contents)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())

Output:

<i style="mso-bidi-font-style:normal">
 <span style="font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial">
 </span>
</i>

The span and i tags are saved because they have attributes, although there's no content.


The existing answers in here have a slight problem as they all remove the <br> element which is always empty but crucial for the structure of the HTML.

Keep all breaks

 [x.decompose() for x in soup.findAll(lambda tag: not tag.contents and not tag.name == 'br' )]

Source

<p><p></p><strong>some<br>text<br>here</strong></p>

Output

<p><strong>some<br>text<br>here</strong></p>

Remove also elements full of whitespace

Also in case you also want to remove tags that only contain white-space you may want to do something like

[x.decompose() for x in soup.findAll(lambda tag: (not tag.contents or len(tag.get_text(strip=True)) <= 0) and not tag.name == 'br' )]

Source

<p><p>    </p><p></p><strong>some<br>text<br>here</strong></p>

Output

<p><strong>some<br>text<br>here</strong></p>