Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.
π‘ Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
β₯οΈ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month
Part 1 focused on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requestslibrary. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Part 2 focused on:
- Configuring a page URL for scraping
- Setting a delay:
time.sleep()to pause between page scrapes. - Looping through two (2) pages for testing purposes.
Part 3 focuses on:
- Locating Book details.
- Writing code to retrieve this information for all Books.
- Saving
Bookdetails to a List.
Part 4 focuses on:
- Cleaning up the scraped code.
- Saving the output to a CSV file.
Preparation
This article assumes you have completed the following from Part 1 and Part 2:
- Installed the required libraries.
- Successfully connected to the
Books to Scrapewebsite. - Retrieved the
Total Numberof pages to scrape. - Configured the page URL for scraping.
- Set a time delay to pause between page scrapes.
- Successfully looped through two (2) test pages.
import pandas as pd import requests from bs4 import BeautifulSoup import time import urllib.request from csv import reader, writer
Overview
Each Book on the top-level pages of the Books to Scrape website contains a:
- Thumbnail image.
- Book Title hyperlink.
- Price.
In stockreference.Add to basketButton.
This section will scrape two (2) of these top-level pages.
Locate Book Details
Navigating through the site shows us that the setup for each Book is identical across all pages.

To view the HTML code associated with each Book, perform the following steps:
- Open a browser and navigate to the Books to Scrape website.
- With the mouse, hover over any thumbnail.
- Right-mouse click to display a pop-up menu.
- Click to select the
Inspectmenu item. This option opens the HTML code window to the right of the browser window.

Upon reviewing the HTML code, we notice that the <img> tag with the highlight is wrapped inside <article class="product_prod"></article> tags.

Let’s confirm this by using our mouse to hover over the <article class="product_prod"> tag in the HTML code.
If correct, the selected Book on the left highlights.

Great! We can work with this!
Let’s move back to an IDE and write some Python Code!
π‘ Note: The code below has been brought forward from Part 2. The lines in yellow are new or modified.
web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []
if res:
soup = BeautifulSoup(res.text, 'html.parser')
total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])
while cur_page <= 2: # total_pgs:
pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
res1 = requests.get(pg_url)
if res1:
soup = BeautifulSoup(res1.text, "html.parser")
print(f"Scraping: {pg_url}")
all_articles = soup.find_all('article')
for article in all_articles:
b_href = article.find('a')['href']
b_src = article.find('img')['src']
b_title = article.find('img')['alt']
b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1]
b_price = article.find('p', class_='price_color').text
all_books.append([b_href, b_src, b_title, b_rtg, b_price])
cur_page += 1
time.sleep(2)
else:
print(f"The following error occured: {res1}")
res.close()
res1.close()
else:
print(f"The following error occured: {res}")
print(all_books)- Line [1] declares the list variable
all_books. - Line [2] locates all
<article>tags on the current web page. This output saves toall_articles. - Line [3] initiates a
forloop to traverse through each<article></article>tag on the current page.- Line [4] retrieves and saves the
hrefvalue to theb_hrefvariable. - Line [5] retrieves and saves the image source to the
b_srcvariable. - Line [6] retrieves and saves the title to the
b_titlevariable. - Line [7]retrieves and saves the rating to the
b_rtgvariable. - Line [8] retrieves and saves the price to the
b_pricevariable. - Line [9] appends this information to the
all_bookslist created earlier.
- Line [4] retrieves and saves the
- Line [10] outputs the contents of
all_booksto the terminal.
Output (Snippet)
The contents of all_books[] should now contain 40 rows.
[['catalogue/a-light-in-the-attic_1000/index.html', 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'A Light in the Attic', 'Three', 'Β£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'Tipping the Velvet', 'One', 'Β£53.74'],[['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/a-light-in-the-attic_1000/index.html', 'A Light in the Attic', 'Three', 'Β£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'catalogue/tipping-the-velvet_999/index.html', 'Tipping the Velvet', 'One', 'Β£53.74'], .....]] |
π‘ Note: You may want to remove Line [10] before continuing.
Summary
In this article, you learned how to:
- Locate Book details.
- Write code to retrieve this information.
- Save Book details to a List.
What’s Next
In Part 4 of this series, we will clean up the code and save the results to a CSV file.