How to Scrape HTML Tables – Part 1

5/5 - (5 votes)

Story: This series of articles assumes you are a contractor hired by the NHL (National Hockey League) to produce a CSV file based on Team Stats from 1990-2011.

The data for this series is located on a live website in HTML table format.

πŸ’‘ Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS.


Part 1 focuses on:

  • Describing HTML Tables.
  • Reviewing the NHL website.
  • Understanding HTTP Status Codes.
  • Connecting to the NHL website using the requests library.
  • Viewing the HTML code.
  • Closing the Open Connection.

Part 2 focuses on:

  • Retrieving Total Number of Pages
  • Configuring the Page URL
  • Creating a While Loop to Navigate Pages

Part 3 focuses on:

  • Looping through the NFL web pages.
  • Scraping the data from each page.
  • Exporting the data to a CSV file.

Preparation

Before any data manipulation can occur, three (3) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The Requests library provides access to HTTP requests in Python.
  • The Beautiful Soup library enables data extraction from HTML and XML files.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

πŸ’‘ Note: The time library is built-in and does not require installation.
This library contains time.sleep() used to set a delay between page scrapes. This code is in Part 3.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install requests

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install beautifulsoup4

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guides for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

What are HTML Tables?

HTML tables offer Web Designers/Developers a way to arrange data into rows and columns. HTML tables are similar to Excel spreadsheets.

HTML tables are made up of:

  • a table structure (<table></table>)
  • a heading row (<th></th>)
  • unlimited rows (<tr></tr>)
  • unlimited columns (<td></td>)

In HTML, tables are set up similar to the code below.

<table>
<tr>
   <th>col 1</h1>
   <th>col 2</h1>
</tr>
<tr>
    <td>data 1</td>
    <td>data 2</td>
</tr>
</table>

Below is a partial sample of an HTML table. This table is located on the NFL website we will be scraping.

πŸ’‘ Note: For additional information on HTML tables, click here.


Website Review

Let’s navigate to the NHL website and review the format.

At first glance, you will notice:

  • the web page displays the NHL stats inside a formatted structure (an HTML table).
  • a pagination area at the bottom depicting:
    • page hyperlinks from 1- 24.
    • a next page hyperlink (>>).
  • a Per Page (dropdown box) displaying 25 records per page (by default).

πŸ’‘ Note: This series of articles uses the Google Chrome browser.


HTTP Response Codes

When you attempt to connect from your Python code to any URL, an HTTP Response Code returns, indicating the connection status.

This code can be any one of the following:

100–199Informational responses
200–299Successful responses
300–399Redirection messages
400–499Client error responses
500–599Server error responses

πŸ’‘ Note: To view a detailed list of HTTP Status Codes, click here.


Connect to NHL Website

Before any scraping can occur, we need to determine if we can successfully connect to this website. We do this using the requests library. If successful, an HTTP Status Code of 200 returns.

Let’s try running this code by performing the following steps:

  • Open an IDE terminal.
  • Create a new Python file (example: hockey.py).
  • Copy and paste the code below into this file.
  • Save and run this file.
web_url = 'https://scrapethissite.com/pages/forms/'
res = requests.get(web_url)
print(res)
  • Line [1] assigns the NHL’s website URL to the web_url variable.
  • Line [2] attempts to connect to the NHL’s website using the requests.get() method. An HTTP Status Code returns and saves to the res variable.
  • Line [3] outputs the contents of the res variable to the terminal.

Output:

<Response [200]>

Great news! The connection to the NHL website works!

πŸ’‘ Note: You may want to remove Line [3] before continuing.


HTML Code Overview

The next step is to view the HTML code. This step enables us to locate specific HTML elements/tags we need to scrape the data.

There are two (2) ways to perform this task:

  1. Run Python code to send the HTML code to the terminal window and locate the required information by scrolling through the HTML code.
  2. Display the HTML code in the current browser window and use the Inspect tool to locate the required information.

View HTML Code in Terminal

To view the HTML code in a terminal window, navigate to an IDE, and run the following code:

πŸ’‘ Note: Remember to add in the Required Starter Code.

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    print(soup.prettify())
else:
    print(f'The following error occured: {res}')
  • Line [1] initiates an if statement. If the variable res contains the value 200 (success), the code inside this statement executes.
    • Line [2] saves the HTML code of the web page URL (web_url) created earlier to the soup variable.
    • Line [3] outputs the prettify version of the HTML code to the terminal.
  • Lines [4-5] execute if the value of the res variable contains anything other than 200 (success).

πŸ’‘ Note: You may want to remove Line [3] before continuing.

Output:

After running the above code, the visible area of the HTML code in the terminal is the bottom portion denoted by the </html> tag.

πŸ’‘ Note: Scroll up to peruse the entire HTML code


View HTML Code in Browser

To view the HTML code in a browser, perform the following steps:

  • Open a browser and navigate to the NHL website.
  • In any whitespace, right-mouse click to display a pop-up menu.
  • Click to select the Inspect menu item.

The HTML code displays on the right-hand side of the browser window.

In this instance, the top part of the HTML code shows as denoted by the <!DOCTYPE HTML> tag.

Part 2 delves deeper into accessing specific elements/tags now that you are familiar with how to view HTML code.

πŸ’‘ Note: If you are familiar with HTML and CSS, option one (1) may best suit your needs.


Close the Connection

In the code above, a connection to the NFL website was established and opened. First, however, this connection needs to be closed.

An additional line of code is added to resolve this issue.

web_url = 'https://scrapethissite.com/pages/forms/'
res = requests.get(web_url)

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    res.close()
else:
    print(f'The following error occured: {res}')

πŸ’‘ Note: If successful, a connection is made from the Python code to the NFL website. Remember to close this connection when not in use.


Summary

In this article, you learned how to:

  • Review the NHL website.
  • Understand HTTP Status Codes.
  • Connect to the NHL website using the requests library.
  • View HTML code in an IDE.
  • View HTML code in a Web Browser.
  • Close the open connection.

What’s Next

In Part 2 of this series, you will learn to identify elements/tags inside HTML code to create a web scraping app.