BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

Short answer: soup.findAll(text=True)

This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.

UPDATE:

To clarify, a working piece of code:

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

The accepted answer is great but it is 6 years old now, so here's the current Beautiful Soup 4 version of this answer:

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green

I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

Notice that we're first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.

Also:

  • it is better to use the updated 'find_all()' in bs4 than the older findAll()
  • urllib2 was replaced by urllib.request and urllib.error, see here

Now your output should be:

  • Red
  • Blue
  • Yellow
  • Light

Hope this helps someone looking for an updated solution.