Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.
π‘ Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
Part 1 focused on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requests
library. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Part 2 focused on:
- Configuring a page URL for scraping
- Setting a delay:
time.sleep()
to pause between page scrapes. - Looping through two (2) pages for testing purposes.
Part 3 focused on:
- Locating Book details.
- Writing code to retrieve this information for all Books.
- Saving
Book
details to a List.
Part 4 focuses on:
- Cleaning up the scraped code.
- Saving the output to a CSV file.
π‘ Note: This article assumes you have completed the steps in Part 1, Part 2, and Part 3.
Preparation
This article assumes you have completed the following from Part 1, Part 2, and Part 3:
- Installed the required libraries.
- Successfully connected to the
Books to Scrape
website. - Retrieved the
Total Number
of pages to scrape. - Configured the page URL for scraping.
- Set a time delay to pause between page scrapes.
- Scrape and save Book Details to a List.
import pandas as pd import requests from bs4 import BeautifulSoup import time import urllib.request from csv import reader, writer
Overview
The Python code from the bottom section of Part 3 has been brought forward. In this section, we will be cleaning up the output before saving it to a CSV file.
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 all_books = [] if res: soup = BeautifulSoup(res.text, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: # total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" res1 = requests.get(pg_url) if res1: soup = BeautifulSoup(res1.text, "html.parser") print(f"Scraping: {pg_url}") all_articles = soup.find_all('article') for article in all_articles: b_href = article.find('a')['href'] b_src = article.find('img')['src'] b_title = article.find('img')['alt'] b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1] b_price = article.find('p', class_='price_color').text all_books.append([b_href, b_src, b_title, b_rtg, b_price]) cur_page += 1 time.sleep(2) else: print(f"The following error occured: {res1}") res.close() res1.close() else: print(f"The following error occured: {res}") print(all_books)
The Sub-Page HREF
The first item we scrape is the sub-page href
for each Book (see above). This page contains additional details the Owner may want. However, this is not covered here.
π‘ Note: The Finxter Challenge is to write additional code to scape each sub-page.
To get you started, let’s modify the b_href
variable. Currently, it displays a partial URL.
b_href = article.find('a')['href']
Output (snippet)
catalogue/a-light-in-the-attic_1000/index.html |
To successfully scrape the sub-pages, we will need a complete URL, not a partial one.
Let’s fix this.
b_href = f"{web_url}/{article.find('a')['href']}"
The above string is formatted using multiple variables to construct a useable URL.
Now if we run the above code, the output should be as shown below.
Output (snippet)
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html |
π‘ Note: To confirm this code is correct, navigate to a URL.
Save the Thumbnail
On the top-level pages, each Book has a thumbnail. This section shows you how to save these thumbnails.
Create a folder thumbs
in the current working directory before running the code below.
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 all_books = [] if res: soup = BeautifulSoup(res.text, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: # total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" res1 = requests.get(pg_url) if res1: soup = BeautifulSoup(res1.text, "html.parser") print(f"Scraping: {pg_url}") all_articles = soup.find_all('article') for article in all_articles: b_href = f"{web_url}/{article.find('a')['href']}" b_src = f"{web_url}/{article.find('img')['src']}" x = b_src.rfind('/') urllib.request.urlretrieve(b_src, f'thumbs/{b_src[x+1:]}') b_title = article.find('img')['alt'] b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1] b_price = article.find('p', class_='price_color').text all_books.append([b_href, b_src, b_title, b_rtg, b_price]) cur_page += 1 time.sleep(2) else: print(f"The following error occured: {res1}") res.close() res1.close() else: print(f"The following error occured: {res}") print(all_books)
- Line [1] scrapes and formats the link to the thumbnail.
- Line [2] finds the last occurrence of the
'/'
character and returns the location of same. - Line [3] retreives the image at the original location and saves it to the current working directory.
Output (snippet)
π‘ Note: If you do not want to save the thumbnails, remove Lines [2-3]. For this example, these lines will be removed.
Modify the Title
We notice that in some instances additional text is appended to the Book Title (see output below).
b_title = article.find('img')['alt']
Output (snippet)
... |
Let’s add some code that will remove the additional characters after the ':'
and '('
character.
For this section, a new function is created and inserted into the code.
def remove_char(string, ch): found = string.find(ch) if found > 0: return string[0:found] return string web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 all_books = [] if res: soup = BeautifulSoup(res.text, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: # total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" res1 = requests.get(pg_url) if res1: soup = BeautifulSoup(res1.text, "html.parser") print(f"Scraping: {pg_url}") all_articles = soup.find_all('article') for article in all_articles: b_href = f"{web_url}/{article.find('a')['href']}" b_src = f"{web_url}/{article.find('img')['src']}" b_title = article.find('img')['alt'] b_title = remove_char(b_title, '(') b_title = remove_char(b_title, ':') b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1] b_price = article.find('p', class_='price_color').text all_books.append([b_href, b_src, b_title, b_rtg, b_price]) cur_page += 1 time.sleep(2) else: print(f"The following error occured: {res1}") res.close() res1.close() else: print(f"The following error occured: {res}")
- Line [1] defines a function and passes two (2) arguments to it (a string and a single character).
- Line [2] searches the string for the existence of the stated character. If found, the location returns.
- Line [3] if found, a sub-string is carved out using slicing and the new string returns.
- Line [4] returns the original string if no match is found.
- Line [5] scrapes the
Book Title
and saves it to theb_title
variable. - Lines [6-7] call the
remove_char()
function twice. Once for each character.
π‘ Note: The variable b_src
contains the original location of the thumbnail. Depending on your requirements, you may want to modify this.
Modify the Price
As mentioned in Part 1, all Book prices display in Β£ (in this instance, UK pound).
b_price = article.find('p', class_='price_color').text
Output (snippet)
Β£52.29
Let’s keep the same pricing but switch the Β£ currency character to the $ character.
Replace the b_price
line above with this line and re-run the code.
b_price = article.find('p', class_='price_color').text.replace('Β£', '$')
If you review the output, you will see that all occurrences of the Β£ have now been replaced by a $.
Output (snippet)
$52.29
π‘ Note: Depending on your requirements, you may want to remove the Β£ entirely and convert the data type to an integer.
Save to a CSV
Now that all the data has been cleaned up. Let’s save this data to a CSV file.
with open('books.csv', 'w', encoding='UTF8', newline='') as csv_file: csv_writer = writer(csv_file) csv_writer.writerow(['Sub-Pg', 'Thumb', 'Title', 'Rating', 'Price']) for c in all_books: csv_writer.writerow(c)
- Line [1] opens a CSV file in write (w) mode using the appropriate encoding and newline character.
- Line [2] creates a
csv_writer
object. - Line [3] writes the
Header Row
to the CSV file. - Line [4] initiates a
for
loop. This loops for each row inall_books
.- Line [5] writes the elements to columns in a CSV row.
- Line [2] creates a
Let’s open the CSV file to see what we have.
We have 41 rows! Two (2) pages containing 20 books/page plus the header row.
Complete Code
Now that all the testing is complete, you are ready to scrape all 50 pages of Books to Scrape!
The While Loop in the code below is modified to accommodate the scraping of the entire site!
Run the code below to complete the project.
def remove_char(string, ch): found = string.find(ch) if found > 0: return string[0:found] return string web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 all_books = [] if res: soup = BeautifulSoup(res.text, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" res1 = requests.get(pg_url) if res1: soup = BeautifulSoup(res1.text, "html.parser") print(f"Scraping: {pg_url}") all_articles = soup.find_all('article') for article in all_articles: b_href = f"{web_url}/{article.find('a')['href']}" b_src = f"{web_url}/{article.find('img')['src']}" b_title = article.find('img')['alt'] b_title = remove_char(b_title, '(') b_title = remove_char(b_title, ':') b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1] b_price = article.find('p', class_='price_color').text.replace('Β£', '$') all_books.append([b_href, b_src, b_title, b_rtg, b_price]) cur_page += 1 time.sleep(2) else: print(f"The following error occured: {res1}") res.close() res1.close() else: print(f"The following error occured: {res}") with open('books.csv', 'w', encoding='UTF8', newline='') as csv_file: csv_writer = writer(csv_file) csv_writer.writerow(['Sub-Pg', 'Thumb', 'Title', 'Rating', 'Price']) for c in all_books: csv_writer.writerow(c)
The books.csv
should now contain a total of 1,001 rows: 1,000 book details and a header row!
Congratulations! Onward and Upward!