Searching The Parse Tree Using BeautifulSoup

Introduction

HTML (Hypertext Markup Language) consists of numerous tags and the data we need to extract lies inside those tags. Thus we need to find the right tags to extract what we need. Now, how do we find the right tags? We can do so with the help of BeautifulSoup's search methods.

Beautiful Soup has numerous methods for searching a parse tree. The two most popular and commonly methods are:

  1.  find()
  2.  find_all()

The other methods are quite similar in terms of their usage. Therefore, we will be focusing on the find() and find_all() methods in this article.

? The following Example will be used throughout this document while demonstrating the concepts:

html_doc = """

<html><head><title>Searching Tree</title></head>
<body>
<h1>Searching Parse Tree In BeautifulSoup</h1></p>

<p class="Main">Learning 
<a href="https://docs.python.org/3/" class="language" id="python">Python</a>,
<a href="https://docs.oracle.com/en/java/" class="language" id="java">Java</a> and
<a href="https://golang.org/doc/" class="language" id="golang">Golang</a>;
is fun!</p>

<p class="Secondary"><b>Please subscribe!</b></p>
<p class="Secondary" id= "finxter"><b>copyright - FINXTER</b></p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")

Types Of Filters

There are different filters that can be passed into the find() and find_all() methods and it is crucial to have a clear understanding of these filters as they are used again and again, throughout the search mechanism. These filters can be used based on the tags:

  • name,
  • attributes,
  • on the text of a string,
  • or a mix of these.

A String

When we pass a string to a search method then Beautiful Soup performs a match against that passed string. Let us have a look at an example and find the <h1> tags in the HTML document:

print(soup.find_all('h1'))

Output:

[<h1>Searching Parse Tree In BeautifulSoup</h1>]

❖ A Regular Expression

Passing a regular expression object allows Beautiful Soup to filter results according to that regular expression. In case you want to master the concepts of the regex module in Python, please refer to our tutorial here.

Note:

  • We need to import the re module to use a regular expression.
  • To get just the name of the tag instead of the entire content (tag+ content within the tag), use the .name attribute.

Example: The following code finds all instances of the tags starting with the letter “b”.

# finding regular expressions
for regular in soup.find_all(re.compile("^b")):
    print(regular.name)

Output:

body
b

❖ A List

Multiple tags can be passed into the search functions using a list a shown in the example below:

Example: The following code finds all the <a> and <b> tags in the HTML document.

for tag in soup.find_all(['a','b']):
    print(tag)

Output:

<a class="language" href="https://docs.python.org/3/" id="python">Python</a>
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a>
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>
<b>Please subscribe!</b>

❖ A function

We can define a function and pass an element as its argument. The function returns True in case of a match, otherwise it returns False.

Example: The following code defines a function which returns True for all classes that also have an id in the HTML document. We then pass this function to the find_all() method to get the desired output.

def func(tag):
    return tag.has_attr('class') and tag.has_attr('id')


for tag in soup.find_all(func):
    print(tag)

Output:

<a class="language" href="https://docs.python.org/3/" id="python">Python</a>
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a>
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>

➠ Now that we have gone through the different kind of filters that we use with the search methods, we are well equipped to dive deep into the find() and find_all() methods.

✨ The find() Method

The find() method is used to search for the occurrence of the first instance of a tag with the needed name.

Syntax:

find(name, attrs, recursive, string, **kwargs)

find() returns an object of type bs4.element.Tag.

Example:

print(soup.find('h1'), "\n")
print("RETURN TYPE OF find(): ",type(soup.find('h1')), "\n")
# note that only the first instance of the tag is returned
print(soup.find('a'))

Output:

<h1>Searching Parse Tree In BeautifulSoup</h1> 

RETURN TYPE OF find():  <class 'bs4.element.Tag'> 

<a class="language" href="https://docs.python.org/3/" id="python">Python</a>

➠ The above operation is the same as done by the soup.h1 or soup soup.a which also returns the first instance of the given tag. So what’s, the difference? The find() method helps us to find a particular instance of a given tag using key-value pairs as shown in the example below:

print(soup.find('a',id='golang'))

Output:

<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>

✨ The find_all() Method

We saw that the find() method is used to search for the first tag. What if we want to find all instances of a tag or numerous instances of a given tag within the HTML document? The find_all() method, helps us to search for all tags with the given tag name and returns a list of type bs4.element.ResultSet. Since the items are returned in a list, they can be accessed with help of their index.

Syntax:

find_all(name, attrs, recursive, string, limit, **kwargs)

Example: Searching all instances of the ‘a’ tag in the HTML document.

for tag in soup.find_all('a'):
    print(tag)

Output:

<a class="language" href="https://docs.python.org/3/" id="python">Python</a>
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a>
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>

Now there are numerous other argument apart from the filters that we already discussed earlier. Let us have a look at them one by one.

❖ The name Argument

As stated earlier the name argument can be a string, a regular expression, a list, a function, or the value True.

Example:

for tag in soup.find_all('p'):
    print(tag)

Output:

<p class="Main">Learning 
<a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>
<p class="Secondary"><b>Please subscribe!</b></p>

❖ The keyword Arguments

Just like the find() method, find_all() also allows us to find particular instances of a tag. For example, if the id argument is passed, Beautiful Soup filters against each tag’s ‘id’ attribute and returns the result accordingly.

Example:

print(soup.find_all('a',id='java'))

Output:

[<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a>]

You can also pass the attributes as dictionary key-value pairs using the attrs argument.

Example:

print(soup.find_all('a', attrs={'id': 'java'}))

Output:

[<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a>]

❖ Search Using CSS Class

Often we need to find a tag that has a certain CSS class, but the attribute, class, is a reserved keyword in Python. Thus, using class as a keyword argument will give a syntax error. Beautiful Soup 4.1.2 allows us to search a CSS class using the keyword class_

Example:

print(soup.find_all('p', class_='Secondary'))

Output:

[<p class="Secondary"><b>Please subscribe!</b></p>]

❖ Note: The above search will allow you to search all instances of the p tag with the class “Secondary” . But you can also filter searches based on multiple attributes, using a dictionary.

Example:

print(soup.find_all('p', attrs={'class': 'Secondary', 'id': 'finxter'}))

Output:

[<p class="Secondary" id="finxter"><b>copyright - FINXTER</b></p>]

❖ The string Argument

The string argument allows us to search for strings instead of tags.

Example:

print(soup.find_all(string=["Python", "Java", "Golang"]))

Output:

['Python', 'Java', 'Golang']

❖ The limit Argument

The find_all() method scans through the entire HTML document and returns all the matching tags and strings. This can be extremely tedious and take a lot of time if the document is large. So, you can limit the number of results by passing in the limit argument.

Example: There are three links in the example HTML document, but this code only finds the first two:

print(soup.find_all("a", limit=2))

Output:

[<a class="language" href="https://docs.python.org/3/" id="python">Python</a>, <a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a>]

✨ Other Search Methods

We have successfully explored the most commonly used search methods, i.e., find and find_all(). Beautiful Soup also has other methods for searching the parse tree, but they are quite similar to what we already discussed above. The only differences are where they are used. Let us have a quick look at these methods.

  • find_parents() and find_parent(): these methods are used to traverse the parse tree upwards and look for a tag’s/string’s parent(s).
  • find_next_siblings() and find_next_sibling(): these methods are used to find the next sibling(s) of an element in the HTML document.
  • find_previous_siblings() and find_previous_sibling(): these methods are used to find and iterate over the sibling(s) that appear before the current element.
  • find_all_next() and find_next(): these methods are used to find and iterate over the sibling(s) that appear after the current element.
  • find_all_previous and find_previous(): these methods are used to find and iterate over the tags and strings that appear before the current element in the HTML document.

Example:

current = soup.find('a', id='java')
print(current.find_parent())
print()
print(current.find_parents())
print()
print(current.find_previous_sibling())
print()
print(current.find_previous_siblings())
print()
print(current.find_next())
print()
print(current.find_all_next())
print()

Output:

<p class="Main">Learning 
<a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>

[<p class="Main">Learning 
<a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>, <body>
<h1>Searching Parse Tree In BeautifulSoup</h1>
<p class="Main">Learning 
<a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>
<p class="Secondary"><b>Please subscribe!</b></p>
<p class="Secondary" id="finxter"><b>copyright - FINXTER</b></p>
<p class="Secondary"><b>Please subscribe!</b></p>
</body>, <html><head><title>Searching Tree</title></head>
<body>
<h1>Searching Parse Tree In BeautifulSoup</h1>
<p class="Main">Learning 
<a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>
<p class="Secondary"><b>Please subscribe!</b></p>
<p class="Secondary" id="finxter"><b>copyright - FINXTER</b></p>
<p class="Secondary"><b>Please subscribe!</b></p>
</body></html>, 
<html><head><title>Searching Tree</title></head>
<body>
<h1>Searching Parse Tree In BeautifulSoup</h1>
<p class="Main">Learning 
<a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>
<p class="Secondary"><b>Please subscribe!</b></p>
<p class="Secondary" id="finxter"><b>copyright - FINXTER</b></p>
<p class="Secondary"><b>Please subscribe!</b></p>
</body></html>]

<a class="language" href="https://docs.python.org/3/" id="python">Python</a>

[<a class="language" href="https://docs.python.org/3/" id="python">Python</a>]

<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>

[<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>, <p class="Secondary"><b>Please subscribe!</b></p>, <b>Please subscribe!</b>, <p class="Secondary" id="finxter"><b>copyright - FINXTER</b></p>, <b>copyright - FINXTER</b>, <p class="Secondary"><b>Please subscribe!</b></p>, <b>Please subscribe!</b>]

Conclusion

With that we come to the end of this article; I hope that after reading this article you can search elements within a parse tree with ease! Please subscribe and stay tuned for more interesting articles.

Where to Go From Here?

Enough theory. Let’s get some practice!

Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.

To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

You build high-value coding skills by working on practical coding projects!

Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?

🚀 If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!