How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).

Explanation:

The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).

On the other hand if you add "text" to search.

SoupStrainer(text="my text")

it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/

and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082

So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.

If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.

It seems you try to loop along soup elements in my_custom_strainer method.

In order to do so, you could do it as follows:

soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)

Then slightly modify my_custom_strainer to meet something like:

def my_custom_strainer(soup, attrs):
  for attr in attrs:
    print("attr:" + attr + "=" + attrs[attr])
  for d in soup.findAll(['div','span']):
    if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
      return d.text # meet your needs here
   elif d.name == 'span' and d.text == re.compile("my text"):
      return d.text # meet your needs here

This way you can access the soup objects iteratively.

I recently created a lxml / BeautifulSoup parser for html files, which also searches between specific tags.

The function I wrote opens up a your operating system's file manager and allows you to select the specifi html file to parse.

def openFile(self):
    options = QFileDialog.Options()

    options |= QFileDialog.DontUseNativeDialog
    fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
                                              "All Files (*);;Python Files (*.py)", options=options)
    if fileName:
        file = open(fileName)
        data = file.read()
        soup = BeautifulSoup(data, "lxml")
        for item in soup.find_all('strong'):
            results.append(float(item.text))
    print('Score =', results[1])
    print('Fps =', results[0])

You can see that the tag i specified was 'strong', and i was trying to find the text within that tag.

Hope I could help in someway.

How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

Tags:

Python

Parsing

Django

Python 3.X

Beautifulsoup

Related

Recent Posts