Story: This series of articles assumes you are a contractor hired by the NHL (National Hockey League) to produce a CSV file based on Team Stats from 1990-2011.
The data for this series is located on a live website in HTML table format.
π‘ Note: Before continuing, we recommend you possess, at best, a minimum basic knowledge of HTML and CSS.
Part 1 focused on:
- Describing HTML Tables.
- Reviewing the NHL website.
- Understanding HTTP Status Codes.
- Connecting to the NHL website using the
library.requests
- Viewing the HTML code.
- Closing the Open Connection.
Part 2 focuses on:
- Retrieving Total Number of Pages
- Configuring the Page URL
- Creating a While Loop to Navigate Pages
Part 3 focuses on:
- Looping through the NFL web pages.
- Scraping the data from each page.
- Exporting the data to a CSV file.
Preparation
This article assumes you have installed the following libraries from Part 1:
- The Pandas library.
- The Requests library.
- The Beautiful Soup
import pandas as pd import requests from bs4 import BeautifulSoup import time
Total Pages Overview
There are two (2) ways to retrieve this information:
- Run Python code to send the HTML code to the terminal window and locate the information needed by scrolling through the HTML code.
- Display the HTML code in the current browser window and use the
Inspect
tool to locate the required information.
π‘ Note: The remainder of these articles use Google Chrome to find the required information (Option 2).
Retrieve Total Pages
Our goal in this section is to retrieve the total pages to scrape. This value will be saved in our Python code to use later.
As indicated on the pagination bar, this value is 24.
To locate the HTML code related to this value, perform the following steps:
- Navigate to the NHL website.
- Scroll down to the pagination bar.
- With your mouse, hover over hyperlink 24.
- Right-mouse click to display a pop-up menu.
- Click to select
Inspect
. This option opens the HTML code window to the right of the browser window.
The HTML code relating to the selected hyperlink now contains a highlight.
Upon reviewing the HTML code, we can see that the highlighted Line is the second (2nd) last <li>
element/tag in the HTML code. This is confirmed by the </ul>
tag which closes the open <ul>
(unordered list) tag.
Good to know! Now let’s reference that in our Python code.
web_url = 'https://scrapethissite.com/pages/forms/' res = requests.get(web_url) if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int([li.text for li in soup.find_all('li')][-2].strip()) print(total_pgs) res.close() else: print(f'The following error occured: {res}')
The highlighted code lines are described below.
- Line [1] does the following:
- Uses List Comprehension to loop through all <li> tags inside res.content. This content contains the HTML code of the NFL’s home page.
- Uses slicing to retrieve the second (2nd) last
<li>
element on the web page (24).
- Line [2] outputs the contents of
total_pgs
to the terminal. - Line [3] closes the open connection.
π‘ Note: You may want to remove Line [2] before continuing.
Output
24
Configure Page URL
The next step is to determine how to properly navigate from page to page while performing the scrape operation.
When you first navigate to the NHL site, the URL in the address bar is the following:
https://www.scrapethissite.com/pages/forms/
Let’s see what happens when we click hyperlink [1] in the pagination bar.
The page reloads, and the URL in the address bar changes to the following:
https://www.scrapethissite.com/pages/forms/?page_num=1
Notice the page number appends to the original URL (?page_num=1
).
π‘ Note: Click other hyperlinks in the pagination bar to confirm this.
We can use this configuration to loop through all pages to scrape!
Creating a While Loop
The code below incorporates a While Loop to navigate through all pages (URLs) of the NHL website.
web_url = 'https://scrapethissite.com/pages/forms/' res = requests.get(web_url) cur_page = 1 if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int([li.text for li in soup.find_all('li')][-2].strip()) while cur_page <= total_pgs: pg_url = f'{web_url}?page_num={str(cur_page)}' print(pg_url) cur_page += 1 res.close() else: print(f'The following error occured: {res}')
- Line [1] assigns the NHL’s website URL to the
web_url
variable. - Line [2] attempts to connect to the NHL’s website using the
requests.get()
method. An HTTP Status Code returns and saves to theres
variable. - Line [3] creates a new variable
cur_page
to keep track of the page we are currently on. This variable is initially set to a value of one (1). - Line [4] initiates an
if
statement. If the variableres
contains the value 200 (success), the code inside this statement executes.- Line [5] retrieves the HTML content of the current web page (home page).
- Line [6] uses List Comprehension and Slicing to retrieve the total pages to scrape. This value saves to
total_pgs
.
- Line [7] initiates a While Loop which repeats until
cur_pg
equalstotal_pgs
.- Line [8] creates a new variable
pg_url
by combining the variableweb_url
with thecur_page
variable. - Line [9] outputs the value of the
to the terminal for each loop.pg_url
- Line [10] increases the value of
cur_page
by one (1).
- Line [8] creates a new variable
- Line [11] closes the open connection.
- Lines [12-13] execute if the value of
res
contains anything other than 200 (success).
Output (snippet)
https://scrapethissite.com/pages/forms/?page_num=1 ... |
π‘ Note: You may want to remove Line [9] before continuing.
We’re almost there!
Summary
In this article, you learned how to:
- Use a Web Browser to locate and retrieve Total Pages.
- Configure the URL to loop through all pages of the NHL website.
What’s Next
In Part 3 of this series, you will learn to identify and parse the <table>
tags. Finally, we will put this all together to complete our web scraping app.