BeatifulSoup Find *

Getting Started

This article assumes you have the following libraries installed:

and a basic understanding of:

This article also assumes you are familiar with the following lines of code. Append this code to the top of each script for examples 1-7. Modify the URL as needed.

# Starter Code for Initialization:
from bs4 import BeautifulSoup
import requests

res = requests.get('https://scrapesite.com')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')

Beautifulsoup Find by ID

If the HTML code contains one or more IDs, the find() method on line [4] returns the first (or only) occurrence of the specified ID.

HTML:

<div id="page">
    <h1>First ID</h1>
</div>

Code:

one_div = soup.find(id='page')
print(one_div.text.strip())
  • Lines [1-2] remove the HTML tags and outputs the text without leading and trailing spaces using strip().

Output:

First ID

If there are multiple occurrences, modify line [1] to use the find_all() method.

HTML:

<div id="page">
    <h1>First ID</h1>
</div>
<div id="page">
    <h1>Second ID</h1>
</div>

Code:

all_divs = soup.find_all(id='page')
for d in all_divs:
    print(d.text.strip())
  • Line [1] initializes an iterator.
  • Line [2-3] removes the HTML tags and outputs each <h1> text in the loop without leading and trailing spaces (strip()).

Output:

First ID
Second ID

Beautifulsoup Find Tag

Running the code locates a match based on the description on line [4]. These matches save to all_tags.

HTML:

<span style="color: #FF0000">
Hello World!
</span>

Code:

all_tags = soup.find_all('span', style='color: #FF0000')

for s in all_tags:
    print(s.get_text().strip())
  • Line [1] initializes an iterator.
  • Line [2-3] removes the HTML tags and outputs the text using the get_text() method without leading and trailing spaces using the strip() method.

Output:

Hello World!

Beautifulsoup Find by Class

In the HTML, we have two <article> tags. In order to access the <article> tag, a class was used.

HTML:

      <article class="book">
        <a href="../the-secret-garden/index.html">
        <img src="../c5465a06182ed6ebfa40d049258a2f58.jpg" alt="The Secret Garden"></a>
        <p class="star-rating Four"></p>
       </article>
…

Code:

books = soup.find_all(class_='book')
print(books)

Note the use of the underscore (_) on Line [1] directly after the word class. This character is required, or the code will not run correctly. Line [2] returns and prints the contents as a list.

Output:

[<article class="book">
<a href="../the-secret-garden/index.html">
<img alt="The Secret Garden" src="../c5465a06182ed6ebfa40d049258a2f58.jpg"/></a>
<p class="star-rating Four"></p>
…]

Beautifulsoup Find href

For this example, the href for the HTML <a> tag will be output to the terminal.

HTML:

[<article class="book">
<a href="../the-secret-garden/index.html">
…
</article>
<article class="book">
<a href="../gone-with-wind/index.html">
…
</article>]

Code:

links = soup.find_all('a')
for l in links:
    print(l['href'])
  • Line [1] saves all the <a> tags found to the links variable.
  • Line [2] initializes an iterator.
  • Line [3] removes the HTML tags and outputs the href.

Output:

../the-secret-garden/index.html
../gone-with-wind/index.html

Beautifulsoup Find Attribute

In this HTML example, each book has a Rating. This example extracts the value via Attributes.

HTML:

[<article class="book">
    <a href="../the-secret-garden/index.html">
  <p class="star-rating Four">
</article>
  <article class="book">
    <a href="../gone-with-wind/index.html">
   <p class="star-rating Three">
</article>]

Code:

ratings = soup.find_all('p', class_="star-rating")

for r in ratings:
    print(r.attrs.get("class")[1])
  • Line [1] saves all the <p> tags with a specified class to the ratings variable.
  • Line [2] initializes an iterator.
  • Line [3] removes the HTML tags and outputs the attribute using the get() method.

Output:

Four
Three

Beautifulsoup Nested Tags

To access nested tags, use the select() method. In this case, we have two paragraphs, with five <i> tags nested below the initial <p> tag.

HTML:

     <article class="book">
        <a href="../the-secret-garden/index.html">
        <img src="../c5465a06182ed6ebfa40d049258a2f58.jpg" alt="The Secret Garden"></a>
        <p class="star-rating Four">
          <i class="icon-star">1</i>
          <i class="icon-star">2</i>
          <i class="icon-star">3</i>
          <i class="icon-star">4</i>
          <i class="icon-star">5</i>
        </p>
        </article>
        ...

Code:

nested = soup.select('p i')
for n in nested:
    print(n.text)
  • Line [1] saves all the <p><i> tags to the nested variable.
  • Lines [2-3] remove the HTML tags and outputs the text.

Beautifulsoup Find Text

This example looks for the occurrence of the string 'Finxter'. When the code below runs, the output returns as a list.

  • Line [1] finds all the occurrences and saves them to a list.
  • Line [2] accesses the index number and the text method and output the anchor text.

OR

  • Lines [1-2] initialize an iterator and loop through the list to display the anchor text.

HTML:

...
<a href="https://app.finxter.com/learn/computer/science/" class="blog">Finxter</a>
 ...
strings = soup.find_all('a', string='Finxter')
print(strings[0].text)

OR

for s in strings:
   print(s.text)

Output:

Finxter

Beautifulsoup XPath

Beautifulsoup, by itself, does not support XPath expressions. The lxml library is needed to parse data from an element.

Install the Library

To install the lxml library on your IDE, navigate to the terminal. At the command prompt ($), enter the code below. The command prompt ($) on your terminal may be different.

Code:

$ pip install lxml

Hit the <enter> key to start the installation.

If successful, a message is displayed on the terminal indicating this.

XPath Example

Below is a code example that will run on its own to show how to use XPath to locate HTML nodes.

  • Lines [1-2] import the two libraries shown in the initialization code above. These two libraries are required for this example as well.
  • Line [3] imports the etree module from the lxml library. Etree looks for HTML elements, such as an id, CSS selectors, tags, etc. The etree Xpath method scans these through nested relationships of HTML nodes, similar to a file path.
  • Line [4] is the web page in a string variable (htext).
  • Lines [5-6] parses the href tag and retrieves the <a> tag text. 

To accomplish this, you need to drill down to reach this tag. In this example, there are:

  • two <div> tags in the HTML code
  • one <a> tag

From the <a> tag, the text is retrieved by referencing the text() method.

  • Line [7] outputs the text.
from bs4 import BeautifulSoup
import requests
from lxml import etree

htext = """
<!doctype html>
<html lang="en">
…
   <body>
    <div id="page"> 
      <div class="row">
        <a href="https://app.finxter.com" class="signup">Join</a> 
       </div>
    </div>
   </body>
</html>
"""

result    = etree.HTML(htext)
href_text = result.xpath('//div/div/a/text()')
print(href_text)

Output:

['Join']