Story: This series of articles assumes you are a contractor hired by the NHL (National Hockey League) to produce a CSV file based on Team Stats from 1990-2011.
The data for this series is located on a live website in HTML table format.
π‘ Note: Before continuing, we recommend you possess, at best, a minimum basic knowledge of HTML and CSS.
Part 1 focused on:
- Describing HTML Tables.
- Reviewing the NHL website.
- Understanding HTTP Status Codes.
- Connecting to the NHL website using the
library.requests
- Viewing the HTML code.
- Closing the Open Connection.
Part 2 focused on:
- Retrieving Total Number of Pages
- Configuring the Page URL
- Creating a While Loop to Navigate Pages
Part 3 focuses on:
- Looping through the NFL web pages.
- Scraping the data from each page.
- Exporting the data to a CSV file.
This article assumes you have installed the following libraries from Part 1:
- The Pandas library.
- The Requests library.
- The Beautiful Soup
import pandas as pd import requests from bs4 import BeautifulSoup import time
Overview
This article builds on the Python file (hockey.py
) created in Part 1 and updated in Part 2 (see below).
If you require clarification on the code lines below, click here to navigate to Part 2 of this series.
web_url = 'https://scrapethissite.com/pages/forms/' res = requests.get(web_url) cur_page = 1 if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int([li.text for li in soup.find_all('li')][-2].strip()) while cur_page <= total_pgs: pg_url = f'{web_url}?page_num={str(cur_page)}' cur_page += 1 res.close() else: print(f'The following error occured: {res}')
Retrieve Table Data
The final piece of information we need to retrieve is the data wrapped inside the HTML tables on the NFL website.
Let’s start by performing the following steps:
- Navigate to the home page of the NFL website.
- With the mouse, hover over the top part of the table (
Team Name
). - Right-mouse click to display a pop-up menu.
- Click to select
Inspect
. This option opens the HTML code window to the right of the browser window.
Hover over the HTML tag with the HTML code in view (on the right). This will highlight the table located on the left.
<table class="table">
The <table>
tag includes a reference to a class (<table class="table">
). In HTML, a class identifies an element. We will reference this class in our Python code.
Now we need to write some Python code to access and loop through each element/tag of the table data.
π‘ Note: Click here for a detailed explanation of the HTML class.
The below code puts together everything you will need to scrape the NFL site.
The highlighted code lines are described below.
web_url = 'https://scrapethissite.com/pages/forms/' res = requests.get(web_url) all_recs = [] cur_page = 1 if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int([li.text for li in soup.find_all('li')][-2].strip()) while cur_page <= total_pgs: pg_url = f'{web_url}?page_num={str(cur_page)}' print(f'Scraping: {pg_url}') table = soup.find('table', class_='table') for row in table.find_all('tr')[1:]: cells = row.find_all('td') row = [] for c in cells: row.append(c.text.strip()) all_recs.append(row) cur_page += 1 time.sleep(2) res.close() else: print(f'The following error occured: {res}')
- Line [1] declares
all_recs
to capture all rows of the NFL site. - LIne [2] initiates a While Loop that continues until
cur_page
equalstotal_pgs
.- Line [3] configures the URL based on the
web_url
andcur_page
variables. - Line [4] outputs the page currently being scraped to the terminal.
- Line [5] identifies and retrieves the <table> data for the page. In Python, HTML classes are referenced as
class_='someclass'
.
- Line [6] initiates a
for
loop. This loop starts at the second (2nd) row omitting the header row.- Line [7] locates all
<td>
tags inside the row. - Line [8] declares an empty list
row[]
. - Line [9-11] loops through each cell (column
<td>
) in the table and appends the data to therow[]
list. When the data has been added for the current row, this row is then appended toall_recs[]
.
- Line [7] locates all
- Line [12] adds one (1) to the value of
cur_page
. - Line [13] delays the execution of the script for two (2) seconds.
- The loop repeats until
cur_page
equalstotal_pgs
.
- Line [3] configures the URL based on the
- Line [14] closes the open connection.
Export to CSV
Let’s see how we have done. According to our records, if we have 24 pages containing 25 records per page, we should have a total of 600 rows. If we include the header row, 601.
Append the following code to the end of hockey.py
and re-run to create a CSV file.
π‘ Note: This CSV file saves to the current working directory.
hdr_row = ['Team', 'Year', 'Wins', 'Losses', 'OTL', 'Win', 'GF', 'GA', '+/-'] df = pd.DataFrame(all_recs, columns=hdr_row) df.to_csv('teams.csv', index=False)
- Line [1] creates a Header Row (
hdr_row
) as a list. This list contains the name of each column for the CSV file. - Line [2] creates a DataFrame based on the contents of
all_recs[]
. Thehdr_row
created above are the CSV headings. - Line [3] uses
to_csv()
to create a CSV file and save it to the current working directory.
Let’s open up this CSV file and see what we have.
Wonderful! As expected, 600 rows of data plus a header row for a total of 601 rows!