Scraping a Bookstore – Part 3

Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.

πŸ’‘ Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.


Part 1 focused on:

  • Reviewing the website to scrape.
  • Understanding HTTP Status Codes.
  • Connecting to the Books to Scrape website using the requests library.
  • Retrieving Total Pages to Scrape
  • Closing the Open Connection.

Part 2 focused on:

  • Configuring a page URL for scraping
  • Setting a delay: time.sleep() to pause between page scrapes.
  • Looping through two (2) pages for testing purposes.

Part 3 focuses on:

  • Locating Book details.
  • Writing code to retrieve this information for all Books.
  • Saving Book details to a List.

Part 4 focuses on:

  • Cleaning up the scraped code.
  • Saving the output to a CSV file.

Preparation

This article assumes you have completed the following from Part 1 and Part 2:

  • Installed the required libraries.
  • Successfully connected to the Books to Scrape website.
  • Retrieved the Total Number of pages to scrape.
  • Configured the page URL for scraping.
  • Set a time delay to pause between page scrapes.
  • Successfully looped through two (2) test pages.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer

Overview

Each Book on the top-level pages of the Books to Scrape website contains a:

  • Thumbnail image.
  • Book Title hyperlink.
  • Price.
  • In stock reference.
  • Add to basket Button.

This section will scrape two (2) of these top-level pages.


Locate Book Details

Navigating through the site shows us that the setup for each Book is identical across all pages.

To view the HTML code associated with each Book, perform the following steps:

  • Open a browser and navigate to the Books to Scrape website.
  • With the mouse, hover over any thumbnail.
  • Right-mouse click to display a pop-up menu.
  • Click to select the Inspect menu item. This option opens the HTML code window to the right of the browser window.

Upon reviewing the HTML code, we notice that the <img> tag with the highlight is wrapped inside <article class="product_prod"></article> tags.

Let’s confirm this by using our mouse to hover over the <article class="product_prod"> tag in the HTML code.

If correct, the selected Book on the left highlights.

Great! We can work with this!


Let’s move back to an IDE and write some Python Code!

πŸ’‘ Note: The code below has been brought forward from Part 2. The lines in yellow are new or modified.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2:  # total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        res1 = requests.get(pg_url)

        if res1:
            soup = BeautifulSoup(res1.text, "html.parser")
            print(f"Scraping: {pg_url}")

            all_articles = soup.find_all('article')
            for article in all_articles:                
                b_href  = article.find('a')['href']
                b_src   = article.find('img')['src']
                b_title = article.find('img')['alt']
                b_rtg   = article.find("p", class_="star-rating").attrs.get("class")[1]
                b_price = article.find('p', class_='price_color').text
                all_books.append([b_href, b_src, b_title, b_rtg, b_price])
            cur_page += 1
            time.sleep(2)
        else:
            print(f"The following error occured: {res1}")
    res.close()
    res1.close()
else:
    print(f"The following error occured: {res}")
print(all_books)
  • Line [1] declares the list variable all_books.
  • Line [2] locates all <article> tags on the current web page. This output saves to all_articles.
  • Line [3] initiates a for loop to traverse through each <article></article> tag on the current page.
    • Line [4] retrieves and saves the href value to the b_href variable.
    • Line [5] retrieves and saves the image source to the b_src variable.
    • Line [6] retrieves and saves the title to the b_title variable.
    • Line [7]retrieves and saves the rating to the b_rtg variable.
    • Line [8] retrieves and saves the price to the b_price variable.
    • Line [9] appends this information to the all_books list created earlier.
  • Line [10] outputs the contents of all_books to the terminal.

Output (Snippet)

The contents of all_books[] should now contain 40 rows.

[['catalogue/a-light-in-the-attic_1000/index.html', 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'A Light in the Attic', 'Three', 'Β£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'Tipping the Velvet', 'One', 'Β£53.74'],[['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/a-light-in-the-attic_1000/index.html', 'A Light in the Attic', 'Three', 'Β£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'catalogue/tipping-the-velvet_999/index.html', 'Tipping the Velvet', 'One', 'Β£53.74'], .....]]

πŸ’‘ Note: You may want to remove Line [10] before continuing.


Summary

In this article, you learned how to:

  • Locate Book details.
  • Write code to retrieve this information.
  • Save Book details to a List.

What’s Next

In Part 4 of this series, we will clean up the code and save the results to a CSV file.