How to Scrape HTML Tables – Part 3

Story: This series of articles assumes you are a contractor hired by the NHL (National Hockey League) to produce a CSV file based on Team Stats from 1990-2011.

The data for this series is located on a live website in HTML table format.

πŸ’‘ Note: Before continuing, we recommend you possess, at best, a minimum basic knowledge of HTML and CSS.


Part 1 focused on:

  • Describing HTML Tables.
  • Reviewing the NHL website.
  • Understanding HTTP Status Codes.
  • Connecting to the NHL website using the requests library.
  • Viewing the HTML code.
  • Closing the Open Connection.

Part 2 focused on:

  • Retrieving Total Number of Pages
  • Configuring the Page URL
  • Creating a While Loop to Navigate Pages

Part 3 focuses on:

  • Looping through the NFL web pages.
  • Scraping the data from each page.
  • Exporting the data to a CSV file.

This article assumes you have installed the following libraries from Part 1:


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

Overview

This article builds on the Python file (hockey.py) created in Part 1 and updated in Part 2 (see below).

If you require clarification on the code lines below, click here to navigate to Part 2 of this series.

web_url = 'https://scrapethissite.com/pages/forms/'
res = requests.get(web_url)
cur_page = 1

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int([li.text for li in soup.find_all('li')][-2].strip())
   
    while cur_page <= total_pgs:
        pg_url = f'{web_url}?page_num={str(cur_page)}'

        cur_page += 1
    res.close()
else:
    print(f'The following error occured: {res}')

Retrieve Table Data

The final piece of information we need to retrieve is the data wrapped inside the HTML tables on the NFL website.

Let’s start by performing the following steps:

  • Navigate to the home page of the NFL website.
  • With the mouse, hover over the top part of the table (Team Name).
  • Right-mouse click to display a pop-up menu.
  • Click to select Inspect. This option opens the HTML code window to the right of the browser window.

Hover over the HTML tag with the HTML code in view (on the right). This will highlight the table located on the left.

<table class="table">

The <table> tag includes a reference to a class (<table class="table">). In HTML, a class identifies an element. We will reference this class in our Python code.

Now we need to write some Python code to access and loop through each element/tag of the table data.

πŸ’‘ Note: Click here for a detailed explanation of the HTML class.

The below code puts together everything you will need to scrape the NFL site.

The highlighted code lines are described below.

web_url = 'https://scrapethissite.com/pages/forms/'
res = requests.get(web_url)
all_recs = []
cur_page = 1

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int([li.text for li in soup.find_all('li')][-2].strip())

    while cur_page <= total_pgs:
        pg_url = f'{web_url}?page_num={str(cur_page)}'
        print(f'Scraping: {pg_url}')

        table = soup.find('table', class_='table')
        for row in table.find_all('tr')[1:]: 
            cells = row.find_all('td')

            row = []
            for c in cells:
                row.append(c.text.strip())
            all_recs.append(row)
        cur_page += 1
        time.sleep(2)
    res.close()
else:
    print(f'The following error occured: {res}')
  • Line [1] declares all_recs to capture all rows of the NFL site.
  • LIne [2] initiates a While Loop that continues until cur_page equals total_pgs.
    • Line [3] configures the URL based on the web_url and cur_page variables.
    • Line [4] outputs the page currently being scraped to the terminal.
    • Line [5] identifies and retrieves the <table> data for the page. In Python, HTML classes are referenced as class_='someclass'.
    • Line [6] initiates a for loop. This loop starts at the second (2nd) row omitting the header row.
      • Line [7] locates all <td> tags inside the row.
      • Line [8] declares an empty list row[].
      • Line [9-11] loops through each cell (column <td>) in the table and appends the data to the row[] list. When the data has been added for the current row, this row is then appended to all_recs[].
    • Line [12] adds one (1) to the value of cur_page.
    • Line [13] delays the execution of the script for two (2) seconds.
    • The loop repeats until cur_page equals total_pgs.
  • Line [14] closes the open connection.

Export to CSV

Let’s see how we have done. According to our records, if we have 24 pages containing 25 records per page, we should have a total of 600 rows. If we include the header row, 601.

Append the following code to the end of hockey.py and re-run to create a CSV file.

πŸ’‘ Note: This CSV file saves to the current working directory.

hdr_row  = ['Team', 'Year', 'Wins', 'Losses', 'OTL', 'Win', 'GF', 'GA', '+/-']
df = pd.DataFrame(all_recs, columns=hdr_row)
df.to_csv('teams.csv', index=False)
  • Line [1] creates a Header Row (hdr_row) as a list. This list contains the name of each column for the CSV file.
  • Line [2] creates a DataFrame based on the contents of all_recs[]. The hdr_row created above are the CSV headings.
  • Line [3] uses to_csv() to create a CSV file and save it to the current working directory.

Let’s open up this CSV file and see what we have.

Wonderful! As expected, 600 rows of data plus a header row for a total of 601 rows!