Preparation
This article assumes you have the following libraries installed:
and a basic understanding of:
- HTML
- CSS
- Python
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
# Starter Code for Initialization: from bs4 import BeautifulSoup import requests res = requests.get('https://scrapesite.com') res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser')
Beautifulsoup Find by ID
If the HTML code contains one or more IDs, the find()
method on line [4] returns the first (or only) occurrence of the specified ID.
HTML
<div id="page"> <h1>First ID</h1> </div>
Python Code
one_div = soup.find(id='page') print(one_div.text.strip())
- Line [1] locates the first occurrence of the HTML id tag
page
and saves it toone_div
. - Line [2] removes the HTML tags and outputs the text without leading and trailing spaces using
strip()
.
Output
First ID
If there are multiple occurrences, modify line [1] to use the find_all()
method.
HTML
<div id="page"> <h1>First ID</h1> </div> <div id="page"> <h1>Second ID</h1> </div>
Python Code
all_divs = soup.find_all(id='page') for d in all_divs: print(d.text.strip())
- Line [1] searches for all occurrences of the id tag
page
. - Line [2] initializes an iterator.
- Line [3] removes the HTML tags and outputs each
<h1>
text in the loop without leading and trailing spaces (strip()
).
- Line [3] removes the HTML tags and outputs each
Output
First ID Second ID
Beautifulsoup Find Tag
Running the code locates a match based on the description on line [4]. These matches save to all_tags
.
HTML
<span style="color: #FF0000"> Hello World! </span>
Python Code
all_tags = soup.find_all('span', style='color: #FF0000') for s in all_tags: print(s.get_text().strip())
- Line [1] searches for all occurrences of the HTML code inside
find_all(
). The output saves toall_tags
. - Line [2] initializes an iterator.
- Line [3] removes the HTML tags and outputs the text using the
get_text()
method without leading and trailing spaces using thestrip()
method.
- Line [3] removes the HTML tags and outputs the text using the
Output
Hello World!
Beautifulsoup Find by Class
In the HTML, we have two <article>
tags. In order to access the <article>
tag, a class was used.
HTML
<article class="book"> <a href="../the-secret-garden/index.html"> <img src="../c5465a06182ed6ebfa40d049258a2f58.jpg" alt="The Secret Garden"></a> <p class="star-rating Four"></p> </article> β¦
Python Code
books = soup.find_all(class_='book') print(books)
π‘ Note: The use of the underscore (_
) on Line [1] directly after the word class
. This character is required, or the code will not run correctly. Line [2] returns and prints the contents as a list.
Output
[<article class="book"> <a href="../the-secret-garden/index.html"> <img alt="The Secret Garden" src="../c5465a06182ed6ebfa40d049258a2f58.jpg"/></a> <p class="star-rating Four"></p> β¦]
Beautifulsoup Find href
For this example, the href
for the HTML <a>
tag will be output to the terminal.
HTML
[<article class="book"> <a href="../the-secret-garden/index.html"> β¦ </article> <article class="book"> <a href="../gone-with-wind/index.html"> β¦ </article>]
Python Code
links = soup.find_all('a') for l in links: print(l['href'])
- Line [1] saves all the
<a>
tags found to thelinks
variable. - Line [2] initializes an iterator.
- Line [3] removes the HTML tags and outputs the
href
.
- Line [3] removes the HTML tags and outputs the
Output
../the-secret-garden/index.html ../gone-with-wind/index.html
Beautifulsoup Find Attribute
In this HTML example, each book has a Rating
. This example extracts the value via Attributes.
HTML
[<article class="book"> <a href="../the-secret-garden/index.html"> <p class="star-rating Four"> </article> <article class="book"> <a href="../gone-with-wind/index.html"> <p class="star-rating Three"> </article>]
Python Code
ratings = soup.find_all('p', class_="star-rating") for r in ratings: print(r.attrs.get("class")[1])
- Line [1] saves all the
<p>
tags with a specified class to theratings
variable. - Line [2] initializes an iterator.
- Line [3] removes the HTML tags and outputs the attribute using the
get()
method.
- Line [3] removes the HTML tags and outputs the attribute using the
Output
Four Three
Beautifulsoup Nested Tags
To access nested tags, use the select()
method. In this case, we have two paragraphs, with five <i>
tags nested below the initial <p>
tag.
HTML
<article class="book"> <a href="../the-secret-garden/index.html"> <img src="../c5465a06182ed6ebfa40d049258a2f58.jpg" alt="The Secret Garden"></a> <p class="star-rating Four"> <i class="icon-star">1</i> <i class="icon-star">2</i> <i class="icon-star">3</i> <i class="icon-star">4</i> <i class="icon-star">5</i> </p> </article> ...
Python Code
nested = soup.select('p i') for n in nested: print(n.text)
- Line [1] saves all the
<p><i>
tags to thenested
variable. - Line [2] initializes an iterator.
- Line [3] removes the HTML tags and outputs the text.
Beautifulsoup Find Text
This example looks for the occurrence of the string 'Finxter'
. When the code below runs, the output returns as a list.
HTML
... <a href="https://app.finxter.com/learn/computer/science/" class="blog">Finxter</a> ...
Python Code
strings = soup.find_all('a', string='Finxter') print(strings[0].text)
- Line [1] finds all the occurrences and saves them to a list.
- Line [2] accesses the index number and the text method and output the anchor text.
OR
for s in strings: print(s.text)
- Line [1] initializes an iterator.
- Line [2] removes the HTML tags and outputs the text.
Output
Finxter
Beautifulsoup XPath
Beautifulsoup, by itself, does not support XPath expressions. The lxml
library is needed to parse data from an element.
Install the Library
To install the lxml
library on your IDE, navigate to the terminal. At the command prompt ($
), enter the code below. The command prompt ($
) on your terminal may be different.
$ pip install lxml
Hit the <enter>
key to start the installation.
If successful, a message is displayed on the terminal indicating this.
XPath Example
Below is a code example that will run on its own to show how to use XPath to locate HTML nodes.
from bs4 import BeautifulSoup import requests from lxml import etree htext = """ <!doctype html> <html lang="en"> β¦ <body> <div id="page"> <div class="row"> <a href="https://app.finxter.com" class="signup">Join</a> </div> </div> </body> </html> """ result = etree.HTML(htext) href_text = result.xpath('//div/div/a/text()') print(href_text)
- Lines [1-2] import the two libraries shown in the Required Starter code above. These two libraries are required for this example as well.
- Line [3] imports the etree module from the lxml library. Etree looks for HTML elements, such as an id, CSS selectors, tags, etc. The etree Xpath method scans these through nested relationships of HTML nodes, similar to a file path.
- Line [4] is the web page in a string variable (
htext
). - Lines [5-6] parses the
href
tag and retrieves the<a>
tag text.
To accomplish this, you need to drill down to reach this tag. In this example, there are:
- two
<div>
tags in the HTML code - one
<a>
tag
From the <a>
tag, the text is retrieved by referencing the text()
method.
- Line [7] outputs the text.
Output
['Join']