Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.
π‘ Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
Part 1 focused on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requests
library. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Part 2 focuses on:
- Configuring a page URL for scraping
- Setting a delay:
time.sleep()
to pause between page scrapes. - Looping through two (2) pages for testing purposes.
Part 3 focuses on:
- Locating Book details.
- Writing code to retrieve this information for all Books.
- Saving
Book
details to a List.
Part 4 focuses on:
- Cleaning up the scraped code.
- Saving the output to a CSV file.
Preparation
This article assumes you have completed the following from Part 1:
- Installed the required libraries.
- Successfully connected to the Books to Scrape website.
- Retrieved the Total Number of pages to scrape.
import pandas as pd import requests from bs4 import BeautifulSoup import time import urllib.request from csv import reader, writer
Configure Page URL
The next step is to determine how to properly navigate from page to page while performing the scrape operation.
When you first navigate to the Books to Scrape site, the URL in the address bar is the following:
https://books.toscrape.com/index.html
Let’s see what happens when we click next
in the footer area.
We forward to page 2
of the website, and the URL format in the address bar changes to the following:
https://books.toscrape.com/catalogue/page-2.html
Now, let’s navigate to the footer area and click the previous
button.
We forward to page 1
of the website, and the URL format in the address bar changes to:
https://books.toscrape.com/catalogue/page-1.html
Notice how the original URL format changes.
The following appends to the original URL:
- a subdirectory:
/catalogue/
- a
page-x.html
: wherex
is the page you are currently on.
π‘ Note: Click next
and previous
in the footer area to confirm this.
We can work with this!
Let’s move to an IDE and write Python code to configure this changing URL.
π‘ Note: The code below has been brought forward from Part 1. The lines in yellow are either new or modified.
At this point, we recommend you not loop through all 50 pages of the website. Instead, let’s change the While Loop to navigate through just two (2) pages.
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 if res: soup = BeautifulSoup(res.text, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: # total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" res1 = requests.get(pg_url) if res1: soup = BeautifulSoup(res1.text, "html.parser") print(f"Scraping: {pg_url}") cur_page += 1 time.sleep(2) else: print(f"The following error occured: {res1}") res.close() res1.close() else: print(f"The following error occured: {res}")
- Line [1] creates a new variable
cur_page
to keep track of the page we are currently on. The initial value is one (1). - Line [2] initiates a While Loop which repeats until
cur_pg
equals 2. The variabletotal_pgs
has been commented out while in test mode.- Line [3] creates a new variable
pg_url
by combining the variablesweb_url
andcur_page
.
Example:https://books.toscrape.com/catalogue/page-1.html
- Line [4] attempts to connect to the
pg_url
stated on Line [3]. If successful, an HTTP Status Code of 200 returns and saves to res1. - Line [5] initiates an if statement. If Line [4] was successful, the code below executes.
- Line [6] retreives the HTML code from pg_url. This output saves to the s
oup variable
. - Line [7] outputs a message to the terminal.
- Line [8] increases the value of
cur_page
by one (1). - Line [9] pauses the code for two (2) seconds between pages using
time.sleep()
.
- Line [6] retreives the HTML code from pg_url. This output saves to the s
- Lines [10-11] execute if the
res1
variable returns a value other than 200 (success).
- Line [3] creates a new variable
- Lines [12-13] close the open connections.
π‘ Note: To comment out code in Python, use the # character. This prevents everything else on the current Line from executing.
The modified code executes twice, as depicted by the output below:
Output
Scraping: https://books.toscrape.com/catalogue/page-1.html |
Summary
In this article, you learned how to:
- Configure a page URL for scraping
- Set a delay:
time.sleep()
to pause between page scrapes. - Loop through two (2) pages for testing purposes.
What’s Next
In Part 3 of this series, you learn to identify additional elements/tags inside the HTML code.