Scraping a Bookstore – Part 2

Rate this post

Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.

💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.


Part 1 focused on:

  • Reviewing the website to scrape.
  • Understanding HTTP Status Codes.
  • Connecting to the Books to Scrape website using the requests library.
  • Retrieving Total Pages to Scrape
  • Closing the Open Connection.

Part 2 focuses on:

  • Configuring a page URL for scraping
  • Setting a delay: time.sleep() to pause between page scrapes.
  • Looping through two (2) pages for testing purposes.

Part 3 focuses on:

  • Locating Book details.
  • Writing code to retrieve this information for all Books.
  • Saving Book details to a List.

Part 4 focuses on:

  • Cleaning up the scraped code.
  • Saving the output to a CSV file.

Preparation

This article assumes you have completed the following from Part 1:

  • Installed the required libraries.
  • Successfully connected to the Books to Scrape website.
  • Retrieved the Total Number of pages to scrape.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer

Configure Page URL

The next step is to determine how to properly navigate from page to page while performing the scrape operation.

When you first navigate to the Books to Scrape site, the URL in the address bar is the following:

https://books.toscrape.com/index.html

Let’s see what happens when we click next in the footer area.

We forward to page 2 of the website, and the URL format in the address bar changes to the following:

https://books.toscrape.com/catalogue/page-2.html

Now, let’s navigate to the footer area and click the previous button.

We forward to page 1 of the website, and the URL format in the address bar changes to:

https://books.toscrape.com/catalogue/page-1.html

Notice how the original URL format changes.

The following appends to the original URL:

  • a subdirectory: /catalogue/
  • a page-x.html: where x is the page you are currently on.

💡 Note: Click next and previous in the footer area to confirm this.

We can work with this!

Let’s move to an IDE and write Python code to configure this changing URL.

💡 Note: The code below has been brought forward from Part 1. The lines in yellow are either new or modified.

At this point, we recommend you not loop through all 50 pages of the website. Instead, let’s change the While Loop to navigate through just two (2) pages.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2:  # total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        res1 = requests.get(pg_url)

        if res1:
            soup = BeautifulSoup(res1.text, "html.parser")
            print(f"Scraping: {pg_url}")
            cur_page += 1
            time.sleep(2)
        else:
            print(f"The following error occured: {res1}")
    res.close()
    res1.close()
else:
    print(f"The following error occured: {res}")
  • Line [1] creates a new variable cur_page to keep track of the page we are currently on. The initial value is one (1).
  • Line [2] initiates a While Loop which repeats until cur_pg equals 2. The variable total_pgs has been commented out while in test mode.
    • Line [3] creates a new variable pg_url by combining the variables web_url and cur_page.
      Example: https://books.toscrape.com/catalogue/page-1.html
    • Line [4] attempts to connect to the pg_url stated on Line [3]. If successful, an HTTP Status Code of 200 returns and saves to res1.
    • Line [5] initiates an if statement. If Line [4] was successful, the code below executes.
      • Line [6] retreives the HTML code from pg_url. This output saves to the soup variable.
      • Line [7] outputs a message to the terminal.
      • Line [8] increases the value of cur_page by one (1).
      • Line [9] pauses the code for two (2) seconds between pages using time.sleep().
    • Lines [10-11] execute if the res1 variable returns a value other than 200 (success).
  • Lines [12-13] close the open connections.

💡 Note: To comment out code in Python, use the # character. This prevents everything else on the current Line from executing.

The modified code executes twice, as depicted by the output below:

Output

Scraping: https://books.toscrape.com/catalogue/page-1.html
Scraping: https://books.toscrape.com/catalogue/page-2.html

Summary

In this article, you learned how to:

  • Configure a page URL for scraping
  • Set a delay: time.sleep() to pause between page scrapes.
  • Loop through two (2) pages for testing purposes.

What’s Next

In Part 3 of this series, you learn to identify additional elements/tags inside the HTML code.