Scraping a Bookstore – Part 4

Rate this post

Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.

💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.


Part 1 focused on:

  • Reviewing the website to scrape.
  • Understanding HTTP Status Codes.
  • Connecting to the Books to Scrape website using the requests library.
  • Retrieving Total Pages to Scrape
  • Closing the Open Connection.

Part 2 focused on:

  • Configuring a page URL for scraping
  • Setting a delay: time.sleep() to pause between page scrapes.
  • Looping through two (2) pages for testing purposes.

Part 3 focused on:

  • Locating Book details.
  • Writing code to retrieve this information for all Books.
  • Saving Book details to a List.

Part 4 focuses on:

  • Cleaning up the scraped code.
  • Saving the output to a CSV file.

💡 Note: This article assumes you have completed the steps in Part 1, Part 2, and Part 3.


Preparation

This article assumes you have completed the following from Part 1, Part 2, and Part 3:

  • Installed the required libraries.
  • Successfully connected to the Books to Scrape website.
  • Retrieved the Total Number of pages to scrape.
  • Configured the page URL for scraping.
  • Set a time delay to pause between page scrapes.
  • Scrape and save Book Details to a List.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer

Overview

The Python code from the bottom section of Part 3 has been brought forward. In this section, we will be cleaning up the output before saving it to a CSV file.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2:  # total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        res1 = requests.get(pg_url)

        if res1:
            soup = BeautifulSoup(res1.text, "html.parser")
            print(f"Scraping: {pg_url}")

            all_articles = soup.find_all('article')
            for article in all_articles:                
                b_href  = article.find('a')['href']
                b_src   = article.find('img')['src']
                b_title = article.find('img')['alt']
                b_rtg   = article.find("p", class_="star-rating").attrs.get("class")[1]
                b_price = article.find('p', class_='price_color').text
                all_books.append([b_href, b_src, b_title, b_rtg, b_price])
            cur_page += 1
            time.sleep(2)
        else:
            print(f"The following error occured: {res1}")
    res.close()
    res1.close()
else:
    print(f"The following error occured: {res}")
print(all_books)

The Sub-Page HREF

The first item we scrape is the sub-page href for each Book (see above). This page contains additional details the Owner may want. However, this is not covered here.

💡 Note: The Finxter Challenge is to write additional code to scape each sub-page.

To get you started, let’s modify the b_href variable. Currently, it displays a partial URL.

b_href  = article.find('a')['href']

Output (snippet)

catalogue/a-light-in-the-attic_1000/index.html
catalogue/tipping-the-velvet_999/index.html
catalogue/soumission_998/index.html
...

To successfully scrape the sub-pages, we will need a complete URL, not a partial one.

Let’s fix this.

b_href = f"{web_url}/{article.find('a')['href']}"

The above string is formatted using multiple variables to construct a useable URL.

Now if we run the above code, the output should be as shown below.

Output (snippet)

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html

💡 Note: To confirm this code is correct, navigate to a URL.


Save the Thumbnail

On the top-level pages, each Book has a thumbnail. This section shows you how to save these thumbnails.

Create a folder thumbs in the current working directory before running the code below.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2:  # total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        res1 = requests.get(pg_url)

        if res1:
            soup = BeautifulSoup(res1.text, "html.parser")
            print(f"Scraping: {pg_url}")

            all_articles = soup.find_all('article')
            for article in all_articles:                
                b_href = f"{web_url}/{article.find('a')['href']}"

                b_src = f"{web_url}/{article.find('img')['src']}"
                x = b_src.rfind('/')
                urllib.request.urlretrieve(b_src, f'thumbs/{b_src[x+1:]}')

                b_title = article.find('img')['alt']
                
                b_rtg   = article.find("p", class_="star-rating").attrs.get("class")[1]
                b_price = article.find('p', class_='price_color').text
                all_books.append([b_href, b_src, b_title, b_rtg, b_price])
            cur_page += 1
            time.sleep(2)
        else:
            print(f"The following error occured: {res1}")
    res.close()
    res1.close()
else:
    print(f"The following error occured: {res}")
print(all_books)
  • Line [1] scrapes and formats the link to the thumbnail.
  • Line [2] finds the last occurrence of the '/' character and returns the location of same.
  • Line [3] retreives the image at the original location and saves it to the current working directory.

Output (snippet)

💡 Note: If you do not want to save the thumbnails, remove Lines [2-3]. For this example, these lines will be removed.


Modify the Title

We notice that in some instances additional text is appended to the Book Title (see output below).

b_title = article.find('img')['alt']

Output (snippet)

...
The Coming Woman: A Novel Based on the Life of the Infamous
Starving Hearts (Triangular Trade Trilogy, #1)
...

Let’s add some code that will remove the additional characters after the ':' and '(' character.

For this section, a new function is created and inserted into the code.

def remove_char(string, ch):
    found = string.find(ch)
    if found > 0: return string[0:found]
    return string

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2:  # total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        res1 = requests.get(pg_url)

        if res1:
            soup = BeautifulSoup(res1.text, "html.parser")
            print(f"Scraping: {pg_url}")

            all_articles = soup.find_all('article')
            for article in all_articles:                
                b_href = f"{web_url}/{article.find('a')['href']}"
                b_src = f"{web_url}/{article.find('img')['src']}"

                b_title = article.find('img')['alt']
                b_title = remove_char(b_title, '(')
                b_title = remove_char(b_title, ':')

                b_rtg   = article.find("p", class_="star-rating").attrs.get("class")[1]
                b_price = article.find('p', class_='price_color').text
                all_books.append([b_href, b_src, b_title, b_rtg, b_price])
            cur_page += 1
            time.sleep(2)
        else:
            print(f"The following error occured: {res1}")
    res.close()
    res1.close()
else:
    print(f"The following error occured: {res}")
  • Line [1] defines a function and passes two (2) arguments to it (a string and a single character).
    • Line [2] searches the string for the existence of the stated character. If found, the location returns.
    • Line [3] if found, a sub-string is carved out using slicing and the new string returns.
    • Line [4] returns the original string if no match is found.
  • Line [5] scrapes the Book Title and saves it to the b_title variable.
  • Lines [6-7] call the remove_char() function twice. Once for each character.

💡 Note: The variable b_src contains the original location of the thumbnail. Depending on your requirements, you may want to modify this.


Modify the Price

As mentioned in Part 1, all Book prices display in £ (in this instance, UK pound).

b_price = article.find('p', class_='price_color').text

Output (snippet)

£52.29

Let’s keep the same pricing but switch the £ currency character to the $ character.

Replace the b_price line above with this line and re-run the code.

b_price = article.find('p', class_='price_color').text.replace('£', '$')

If you review the output, you will see that all occurrences of the £ have now been replaced by a $.

Output (snippet)

$52.29

💡 Note: Depending on your requirements, you may want to remove the £ entirely and convert the data type to an integer.


Save to a CSV

Now that all the data has been cleaned up. Let’s save this data to a CSV file.

with open('books.csv', 'w', encoding='UTF8', newline='') as csv_file:
    csv_writer = writer(csv_file)
    csv_writer.writerow(['Sub-Pg', 'Thumb', 'Title', 'Rating', 'Price'])
    
    for c in all_books:
        csv_writer.writerow(c)
  • Line [1] opens a CSV file in write (w) mode using the appropriate encoding and newline character.
    • Line [2] creates a csv_writer object.
    • Line [3] writes the Header Row to the CSV file.
    • Line [4] initiates a for loop. This loops for each row in all_books.
      • Line [5] writes the elements to columns in a CSV row.

Let’s open the CSV file to see what we have.

We have 41 rows! Two (2) pages containing 20 books/page plus the header row.


Complete Code

Now that all the testing is complete, you are ready to scrape all 50 pages of Books to Scrape!

The While Loop in the code below is modified to accommodate the scraping of the entire site!

Run the code below to complete the project.

def remove_char(string, ch):
    found = string.find(ch)
    if found > 0: return string[0:found]
    return string

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []

if res:
    soup = BeautifulSoup(res.text, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        res1 = requests.get(pg_url)

        if res1:
            soup = BeautifulSoup(res1.text, "html.parser")
            print(f"Scraping: {pg_url}")

            all_articles = soup.find_all('article')
            for article in all_articles:                
                b_href = f"{web_url}/{article.find('a')['href']}"
                b_src = f"{web_url}/{article.find('img')['src']}"

                b_title = article.find('img')['alt']
                b_title = remove_char(b_title, '(')
                b_title = remove_char(b_title, ':')

                b_rtg   = article.find("p", class_="star-rating").attrs.get("class")[1]
                b_price = article.find('p', class_='price_color').text.replace('£', '$')
                all_books.append([b_href, b_src, b_title, b_rtg, b_price])
            cur_page += 1
            time.sleep(2)
        else:
            print(f"The following error occured: {res1}")
    res.close()
    res1.close()
else:
    print(f"The following error occured: {res}")

with open('books.csv', 'w', encoding='UTF8', newline='') as csv_file:
    csv_writer = writer(csv_file)
    csv_writer.writerow(['Sub-Pg', 'Thumb', 'Title', 'Rating', 'Price'])
    
    for c in all_books:
        csv_writer.writerow(c)

The books.csv should now contain a total of 1,001 rows: 1,000 book details and a header row!

Congratulations! Onward and Upward!