BeatifulSoup Find *

Preparation

This article assumes you have the following libraries installed:

and a basic understanding of:


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

# Starter Code for Initialization:
from bs4 import BeautifulSoup
import requests

res = requests.get('https://scrapesite.com')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')

Beautifulsoup Find by ID

If the HTML code contains one or more IDs, the find() method on line [4] returns the first (or only) occurrence of the specified ID.

HTML

<div id="page">
    <h1>First ID</h1>
</div>

Python Code

one_div = soup.find(id='page')
print(one_div.text.strip())
  • Line [1] locates the first occurrence of the HTML id tag page and saves it to one_div.
  • Line [2] removes the HTML tags and outputs the text without leading and trailing spaces using strip().

Output

First ID

If there are multiple occurrences, modify line [1] to use the find_all() method.

HTML

<div id="page">
    <h1>First ID</h1>
</div>
<div id="page">
    <h1>Second ID</h1>
</div>

Python Code

all_divs = soup.find_all(id='page')
for d in all_divs:
    print(d.text.strip())
  • Line [1] searches for all occurrences of the id tag page.
  • Line [2] initializes an iterator.
    • Line [3] removes the HTML tags and outputs each <h1> text in the loop without leading and trailing spaces (strip()).

Output

First ID
Second ID

Beautifulsoup Find Tag

Running the code locates a match based on the description on line [4]. These matches save to all_tags.

HTML

<span style="color: #FF0000">
Hello World!
</span>

Python Code

all_tags = soup.find_all('span', style='color: #FF0000')

for s in all_tags:
    print(s.get_text().strip())
  • Line [1] searches for all occurrences of the HTML code inside find_all(). The output saves to all_tags.
  • Line [2] initializes an iterator.
    • Line [3] removes the HTML tags and outputs the text using the get_text() method without leading and trailing spaces using the strip() method.

Output

Hello World!

Beautifulsoup Find by Class

In the HTML, we have two <article> tags. In order to access the <article> tag, a class was used.

HTML

      <article class="book">
        <a href="../the-secret-garden/index.html">
        <img src="../c5465a06182ed6ebfa40d049258a2f58.jpg" alt="The Secret Garden"></a>
        <p class="star-rating Four"></p>
       </article>
…

Python Code

books = soup.find_all(class_='book')
print(books)

πŸ’‘ Note: The use of the underscore (_) on Line [1] directly after the word class. This character is required, or the code will not run correctly. Line [2] returns and prints the contents as a list.

Output

[<article class="book">
<a href="../the-secret-garden/index.html">
<img alt="The Secret Garden" src="../c5465a06182ed6ebfa40d049258a2f58.jpg"/></a>
<p class="star-rating Four"></p>
…]

Beautifulsoup Find href

For this example, the href for the HTML <a> tag will be output to the terminal.

HTML

[<article class="book">
<a href="../the-secret-garden/index.html">
…
</article>
<article class="book">
<a href="../gone-with-wind/index.html">
…
</article>]

Python Code

links = soup.find_all('a')
for l in links:
    print(l['href'])
  • Line [1] saves all the <a> tags found to the links variable.
  • Line [2] initializes an iterator.
    • Line [3] removes the HTML tags and outputs the href.

Output

../the-secret-garden/index.html
../gone-with-wind/index.html

Beautifulsoup Find Attribute

In this HTML example, each book has a Rating. This example extracts the value via Attributes.

HTML

[<article class="book">
    <a href="../the-secret-garden/index.html">
  <p class="star-rating Four">
</article>
  <article class="book">
    <a href="../gone-with-wind/index.html">
   <p class="star-rating Three">
</article>]

Python Code

ratings = soup.find_all('p', class_="star-rating")

for r in ratings:
    print(r.attrs.get("class")[1])
  • Line [1] saves all the <p> tags with a specified class to the ratings variable.
  • Line [2] initializes an iterator.
    • Line [3] removes the HTML tags and outputs the attribute using the get() method.

Output

Four
Three

Beautifulsoup Nested Tags

To access nested tags, use the select() method. In this case, we have two paragraphs, with five <i> tags nested below the initial <p> tag.

HTML

     <article class="book">
        <a href="../the-secret-garden/index.html">
        <img src="../c5465a06182ed6ebfa40d049258a2f58.jpg" alt="The Secret Garden"></a>
        <p class="star-rating Four">
          <i class="icon-star">1</i>
          <i class="icon-star">2</i>
          <i class="icon-star">3</i>
          <i class="icon-star">4</i>
          <i class="icon-star">5</i>
        </p>
        </article>
        ...

Python Code

nested = soup.select('p i')
for n in nested:
    print(n.text)
  • Line [1] saves all the <p><i> tags to the nested variable.
  • Line [2] initializes an iterator.
    • Line [3] removes the HTML tags and outputs the text.

Beautifulsoup Find Text

This example looks for the occurrence of the string 'Finxter'. When the code below runs, the output returns as a list.

HTML

...
<a href="https://app.finxter.com/learn/computer/science/" class="blog">Finxter</a>
 ...

Python Code

strings = soup.find_all('a', string='Finxter')
print(strings[0].text)
  • Line [1] finds all the occurrences and saves them to a list.
  • Line [2] accesses the index number and the text method and output the anchor text.

OR

for s in strings:
   print(s.text)
  • Line [1] initializes an iterator.
    • Line [2] removes the HTML tags and outputs the text.

Output

Finxter

Beautifulsoup XPath

Beautifulsoup, by itself, does not support XPath expressions. The lxml library is needed to parse data from an element.

Install the Library

To install the lxml library on your IDE, navigate to the terminal. At the command prompt ($), enter the code below. The command prompt ($) on your terminal may be different.

$ pip install lxml

Hit the <enter> key to start the installation.

If successful, a message is displayed on the terminal indicating this.


XPath Example

Below is a code example that will run on its own to show how to use XPath to locate HTML nodes.

from bs4 import BeautifulSoup
import requests
from lxml import etree

htext = """
<!doctype html>
<html lang="en">
…
   <body>
    <div id="page"> 
      <div class="row">
        <a href="https://app.finxter.com" class="signup">Join</a> 
       </div>
    </div>
   </body>
</html>
"""

result    = etree.HTML(htext)
href_text = result.xpath('//div/div/a/text()')
print(href_text)
  • Lines [1-2] import the two libraries shown in the Required Starter code above. These two libraries are required for this example as well.
  • Line [3] imports the etree module from the lxml library. Etree looks for HTML elements, such as an id, CSS selectors, tags, etc. The etree Xpath method scans these through nested relationships of HTML nodes, similar to a file path.
  • Line [4] is the web page in a string variable (htext).
  • Lines [5-6] parses the href tag and retrieves the <a> tag text. 

To accomplish this, you need to drill down to reach this tag. In this example, there are:

  • two <div> tags in the HTML code
  • one <a> tag

From the <a> tag, the text is retrieved by referencing the text() method.

  • Line [7] outputs the text.

Output

['Join']