How to remove all html tags from downloaded page

I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
all_text = ''.join(soup.findAll(text=True))

This way you get all the text from a html document.


Try this:

import re

def remove_html_tags(data):
  p = re.compile(r'<.*?>')
  return p.sub('', data)

There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).

bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)

Tags:

Python