How to Scrape HTML Tables – Part 2

5/5 - (1 vote)

Story: This series of articles assumes you are a contractor hired by the NHL (National Hockey League) to produce a CSV file based on Team Stats from 1990-2011.

The data for this series is located on a live website in HTML table format.

💡 Note: Before continuing, we recommend you possess, at best, a minimum basic knowledge of HTML and CSS.


Part 1 focused on:

  • Describing HTML Tables.
  • Reviewing the NHL website.
  • Understanding HTTP Status Codes.
  • Connecting to the NHL website using the requests library.
  • Viewing the HTML code.
  • Closing the Open Connection.

Part 2 focuses on:

  • Retrieving Total Number of Pages
  • Configuring the Page URL
  • Creating a While Loop to Navigate Pages

Part 3 focuses on:

  • Looping through the NFL web pages.
  • Scraping the data from each page.
  • Exporting the data to a CSV file.

Preparation

This article assumes you have installed the following libraries from Part 1:


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

Total Pages Overview

There are two (2) ways to retrieve this information:

  1. Run Python code to send the HTML code to the terminal window and locate the information needed by scrolling through the HTML code.
  2. Display the HTML code in the current browser window and use the Inspect tool to locate the required information.

💡 Note: The remainder of these articles use Google Chrome to find the required information (Option 2).


Retrieve Total Pages

Our goal in this section is to retrieve the total pages to scrape. This value will be saved in our Python code to use later.

As indicated on the pagination bar, this value is 24.

To locate the HTML code related to this value, perform the following steps:

  • Navigate to the NHL website.
  • Scroll down to the pagination bar.
  • With your mouse, hover over hyperlink 24.
  • Right-mouse click to display a pop-up menu.
  • Click to select Inspect. This option opens the HTML code window to the right of the browser window.

The HTML code relating to the selected hyperlink now contains a highlight.

Upon reviewing the HTML code, we can see that the highlighted Line is the second (2nd) last <li> element/tag in the HTML code. This is confirmed by the </ul> tag which closes the open <ul> (unordered list) tag.

Good to know! Now let’s reference that in our Python code.

web_url = 'https://scrapethissite.com/pages/forms/'
res = requests.get(web_url)

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int([li.text for li in soup.find_all('li')][-2].strip())
    print(total_pgs)
    res.close()
else:
    print(f'The following error occured: {res}')

The highlighted code lines are described below.

  • Line [1] does the following:
    • Uses List Comprehension to loop through all <li> tags inside res.content. This content contains the HTML code of the NFL’s home page.
    • Uses slicing to retrieve the second (2nd) last <li> element on the web page (24).
    • Uses strip() to remove any trailing and leading spaces from the string.
    • Uses int() to convert the string to an integer.
    • Saves the above value to total_pgs.
  • Line [2] outputs the contents of total_pgs to the terminal.
  • Line [3] closes the open connection.

💡 Note: You may want to remove Line [2] before continuing.

Output

24

Configure Page URL

The next step is to determine how to properly navigate from page to page while performing the scrape operation.

When you first navigate to the NHL site, the URL in the address bar is the following:

https://www.scrapethissite.com/pages/forms/

Let’s see what happens when we click hyperlink [1] in the pagination bar.

The page reloads, and the URL in the address bar changes to the following:

https://www.scrapethissite.com/pages/forms/?page_num=1

Notice the page number appends to the original URL (?page_num=1).

💡 Note: Click other hyperlinks in the pagination bar to confirm this.

We can use this configuration to loop through all pages to scrape!


Creating a While Loop

The code below incorporates a While Loop to navigate through all pages (URLs) of the NHL website.

web_url = 'https://scrapethissite.com/pages/forms/'
res = requests.get(web_url)
cur_page = 1

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int([li.text for li in soup.find_all('li')][-2].strip())

    while cur_page <= total_pgs:
        pg_url = f'{web_url}?page_num={str(cur_page)}'
        print(pg_url)
        cur_page += 1
    res.close()
else:
    print(f'The following error occured: {res}')
  • Line [1] assigns the NHL’s website URL to the web_url variable.
  • Line [2] attempts to connect to the NHL’s website using the requests.get() method. An HTTP Status Code returns and saves to the res variable.
  • Line [3] creates a new variable cur_page to keep track of the page we are currently on. This variable is initially set to a value of one (1).
  • Line [4] initiates an if statement. If the variable res contains the value 200 (success), the code inside this statement executes.
    • Line [5] retrieves the HTML content of the current web page (home page).
    • Line [6] uses List Comprehension and Slicing to retrieve the total pages to scrape. This value saves to total_pgs.
    • Line [7] initiates a While Loop which repeats until cur_pg equals total_pgs.
      • Line [8] creates a new variable pg_url by combining the variable web_url with the cur_page variable.
      • Line [9] outputs the value of the pg_url to the terminal for each loop.
      • Line [10] increases the value of cur_page by one (1).
    • Line [11] closes the open connection.
  • Lines [12-13] execute if the value of res contains anything other than 200 (success).

Output (snippet)

https://scrapethissite.com/pages/forms/?page_num=1
https://scrapethissite.com/pages/forms/?page_num=2
https://scrapethissite.com/pages/forms/?page_num=3

...
https://scrapethissite.com/pages/forms/?page_num=24

💡 Note: You may want to remove Line [9] before continuing.

We’re almost there!


Summary

In this article, you learned how to:

  • Use a Web Browser to locate and retrieve Total Pages.
  • Configure the URL to loop through all pages of the NHL website.

What’s Next

In Part 3 of this series, you will learn to identify and parse the <table> tags. Finally, we will put this all together to complete our web scraping app.