How can I replace or remove HTML entities like " " using BeautifulSoup 4

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

See Entities in the documentation. BeautifulSoup 4 produces proper Unicode for all entities:

An incoming HTML or XML entity is always converted into the corresponding Unicode character.

Yes, &nbsp; is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.


You can simply replace the non-breaking space unicode with a normal space.

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')

A benefit is that even though you are using BeautifulSoup, you do not need to.