Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.
π‘ Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
Part 1 focused on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requests
library. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Part 2 focused on:
- Configuring a page URL for scraping
- Setting a delay:
time.sleep()
to pause between page scrapes. - Looping through two (2) pages for testing purposes.
Part 3 focuses on:
- Locating Book details.
- Writing code to retrieve this information for all Books.
- Saving
Book
details to a List.
Part 4 focuses on:
- Cleaning up the scraped code.
- Saving the output to a CSV file.
Preparation
This article assumes you have completed the following from Part 1 and Part 2:
- Installed the required libraries.
- Successfully connected to the
Books to Scrape
website. - Retrieved the
Total Number
of pages to scrape. - Configured the page URL for scraping.
- Set a time delay to pause between page scrapes.
- Successfully looped through two (2) test pages.
import pandas as pd import requests from bs4 import BeautifulSoup import time import urllib.request from csv import reader, writer
Overview
Each Book on the top-level pages of the Books to Scrape website contains a:
- Thumbnail image.
- Book Title hyperlink.
- Price.
In stock
reference.Add to basket
Button.
This section will scrape two (2) of these top-level pages.
Locate Book Details
Navigating through the site shows us that the setup for each Book is identical across all pages.
To view the HTML code associated with each Book, perform the following steps:
- Open a browser and navigate to the Books to Scrape website.
- With the mouse, hover over any thumbnail.
- Right-mouse click to display a pop-up menu.
- Click to select the
Inspect
menu item. This option opens the HTML code window to the right of the browser window.
Upon reviewing the HTML code, we notice that the <img>
tag with the highlight is wrapped inside <article class="product_prod"></article>
tags.
Let’s confirm this by using our mouse to hover over the <article class="product_prod">
tag in the HTML code.
If correct, the selected Book on the left highlights.
Great! We can work with this!
Let’s move back to an IDE and write some Python Code!
π‘ Note: The code below has been brought forward from Part 2. The lines in yellow are new or modified.
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 all_books = [] if res: soup = BeautifulSoup(res.text, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: # total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" res1 = requests.get(pg_url) if res1: soup = BeautifulSoup(res1.text, "html.parser") print(f"Scraping: {pg_url}") all_articles = soup.find_all('article') for article in all_articles: b_href = article.find('a')['href'] b_src = article.find('img')['src'] b_title = article.find('img')['alt'] b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1] b_price = article.find('p', class_='price_color').text all_books.append([b_href, b_src, b_title, b_rtg, b_price]) cur_page += 1 time.sleep(2) else: print(f"The following error occured: {res1}") res.close() res1.close() else: print(f"The following error occured: {res}") print(all_books)
- Line [1] declares the list variable
all_books
. - Line [2] locates all
<article>
tags on the current web page. This output saves toall_articles
. - Line [3] initiates a
for
loop to traverse through each<article></article>
tag on the current page.- Line [4] retrieves and saves the
href
value to theb_href
variable. - Line [5] retrieves and saves the image source to the
b_src
variable. - Line [6] retrieves and saves the title to the
b_title
variable. - Line [7]retrieves and saves the rating to the
b_rtg
variable. - Line [8] retrieves and saves the price to the
b_price
variable. - Line [9] appends this information to the
all_books
list created earlier.
- Line [4] retrieves and saves the
- Line [10] outputs the contents of
all_books
to the terminal.
Output (Snippet)
The contents of all_books[]
should now contain 40 rows.
[['catalogue/a-light-in-the-attic_1000/index.html', 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'A Light in the Attic', 'Three', 'Β£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'Tipping the Velvet', 'One', 'Β£53.74'],[['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/a-light-in-the-attic_1000/index.html', 'A Light in the Attic', 'Three', 'Β£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'catalogue/tipping-the-velvet_999/index.html', 'Tipping the Velvet', 'One', 'Β£53.74'], .....]] |
π‘ Note: You may want to remove Line [10] before continuing.
Summary
In this article, you learned how to:
- Locate Book details.
- Write code to retrieve this information.
- Save Book details to a List.
What’s Next
In Part 4 of this series, we will clean up the code and save the results to a CSV file.