Scraping a Bookstore – Part 1

Rate this post

Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.

💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.


Part 1 focuses on:

  • Reviewing the website to scrape.
  • Understanding HTTP Status Codes.
  • Connecting to the Books to Scrape website using the requests library.
  • Retrieving Total Pages to Scrape
  • Closing the Open Connection.

Part 2 focuses on:

  • Configuring a page URL for scraping
  • Setting a delay: time.sleep() to pause between page scrapes.
  • Looping through two (2) pages for testing purposes.

Part 3 focuses on:

  • Locating Book details.
  • Writing code to retrieve this information for all Books.
  • Saving Book details to a List.

Part 4 focuses on:

  • Cleaning up the scraped code.
  • Saving the output to a CSV file.

Preparation

Before any data manipulation can occur, three (3) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The Requests library provides access to the HTTP requests in Python.
  • The Beautiful Soup library enables data extraction from HTML and XML files.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

💡 Note: The time library is built-in with Python and does not require installation. This library contains time.sleep() and is used to set a delay between page scrapes. This code is in Part 2.

💡 Note: The urllib library is built-in with Python and does not require installation. This library contains urllib.request and is used to save images. This code is in Part 4.

💡 Note: The csv library is built-in Pandas and does not require additional installation. This library contains reader and writer methods to save data to a CSV file. This code is in Part 4.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install requests

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install beautifulsoup4

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guides for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer

Website Review

Let’s navigate to Books to Scrape and review the format.

At first glance, you will notice:

  • Book categories display on the left-hand side.
  • There are, in total, 1,000 books listed on the website.
  • Each web page shows 20 Books.
  • Each price is in £ (in this instance, the UK pound).
  • Each Book displays minimum details.
  • To view complete details for a book, click on the image or the Book Title hyperlink. This hyperlink forwards to a page containing additional book details for the selected item (see below).
  • The total number of website pages displays in the footer (Page 1 of 50).

In case the Owner would like additional details above those displayed on the top-level pages, we will save the sub-page href for each Book.

💡 Note: This series of articles uses the Google Chrome browser.


HTTP Response Codes

When you attempt to connect from your Python code to any URL, an HTTP Response Code returns, indicating the connection status.

This code can be any one of the following:

100–199Informational responses
200–299Successful responses
300–399Redirection messages
400–499Client error responses
500–599Server error responses

💡 Note: To view a detailed list of HTTP Status Codes, click here.


Connect to Website

Before any scraping can occur, we need to determine if we can successfully connect to this website. We do this using the requests library. If successful, an HTTP Status Code of 200 returns.

Let’s try running this code by performing the following steps:

  • Open an IDE terminal.
  • Create a new Python file (example: books.py).
  • Copy and paste the code below into this file.
  • Save and run this file.
web_url = "https://books.toscrape.com"
res = requests.get(web_url)

if res:
    print(f"{res}")
    res.close()
else:
    print(f"The following error occured: {res}")
  • Line [1] assigns the Books to Scrape URL to the web_url variable.
  • Line [2] attempts to connect to this website using the requests.get() method. An HTTP Status Code returns and saves to the res variable.
  • Line [3] initiates an if statement. If the res variable is 200 (success), the code inside this statement executes.
    • Line [4] outputs the HTTP Status Code contained in the res variable to the terminal.
    • Line [5] closes the open connection.
  • Lines [6-7] execute if the res variable returns a value other than 200 (success).

Output

<Response [200]>

Great news! The connection to the Books to Scrape website works!

💡 Note: If successful, a connection is made from the Python code to the Books to Scrape website. Remember to close a connection when not in use.

💡 Note: You may want to remove Line [4] before continuing.


Retrieve Total Pages

Our goal in this section is to retrieve the total pages to scrape. This value is saved in our Python code to use later.

As indicated in the footer, this value is 50.

To locate the HTML code relating to this value, perform the following steps:

  • Navigate to the Books to Scrape website.
  • Scroll down to the footer area.
  • With your mouse, hover over the text Page 1 of 50.
  • Right-mouse click to display a pop-up menu.
  • Click to select Inspect. This option opens the HTML code window to the right of the browser window.

The HTML code relating to the chosen text highlights.

Upon review, we notice that the text (Page 1 of 50) is inside an <li> element/tag. We can reference this specific <li> using class_='current'.

Below, we have added a few lines inside the if statement to retrieve and display this information Pythonically.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])
    print(total_pgs)
    res.close()
else:
    print(f"The following error occured: {res}")
  • Line [1] initiates an if statement. If the res variable contains the value of 200 (success), the code inside this statement executes.
    • Line [2] retrieves the HTML code from the home page of Books to Scrape. This HTML code saves to the soup variable.
    • Line [3] searches inside the HTML code in the soup variable for an element/tag (in this case an <li>) where class_='current'.
      If found, the following occurs:
      • The text of the <li class_='current'> tag is retrieved. This tag contains the string Page 1 of 50.
      • All leading and trailing spaces are removed from the string using the strip() method.
      • The split() method splits the string on the space (' ') character. This results in the following list: ['Page', '1', 'of', '50']
      • The last element (element 3) is accessed [3].
      • The output converts to an integer and saves to total_pgs.
    • Line [4] outputs the contents of total_pgs to the terminal.
    • Line [5] closes the open connection.

Output

50

💡 Note: You may want to remove Line [4] before continuing.

💡 Note: Each website places the total number of pages in different locales. You will need to determine how to retrieve this information as required on a per-website basis.


Summary

In this article, you learned how to:

  • Review the Books to Scrape website.
  • Understand HTTP Status Codes.
  • Connect to the Books to Scrape website using the requests library.
  • Locate and Retrieve Total Pages using a Web Browser and HTML code.
  • Close the open connection.

What’s Next

In Part 2 of this series, you will learn to configure a URL for scraping and set a time delay.