In this blog series, powerful Python libraries are leveraged to help uncover some hidden statistical truths in basketball. The first step in any data-driven approach is to identify and collect the data needed.
Luckily for us, Basketball-Reference.com hosts pages of basketball data that can be easily scraped. The processes of this walkthrough can be easily applied to any number of their pages, but for this case, we plan on scraping seasonal statistics of multiple rookie classes.
Project Overview
The Objectives:
- Identify the Data Source
- Download the Page
- Identify Important Page Elements
- Pre-Clean and Extract
- Archive
The Tools:
- Requests
- Beautiful Soup
- Pandas
Though we will inevitably be working with many specialized libraries throughout this project, the above packages will suffice for now.
Identifying the Data Source
Basketball-Reference.com hosts hundreds of curated pages on basketball statistics that range from seasonal averages of typical box score categories like points, rebounds, and shooting percentages, all the way down to the play-by-play action of each game played in the last 20 or so years. One can easily lose their way in this statistical tsunami if there isn’t a clear goal set on what exactly to look for.
The goal here in this post is simple: get rookie data that will help in assessing a young player’s true value and potential.
The following link is one such page. It lists all the relevant statistics of rookies in a particular season.
👉 Link: https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html
In order to accumulate enough data to make solid statistical inferences on players, one year of data won’t cut it. There need to be dozens of years’ worth of data collected to help filter through the noise and come to a conclusion on a player’s future potential.
If an action can be manually repeated, it makes itself a great candidate for automation. In this case, the number in the URL above corresponds to the respective year of that rookie class. Powered by that knowledge, let’s start putting together our first lines of code.
import requests import pandas as pd from bs4 import BeautifulSoup years = list(range(1990, 2017)) url_base = "https://www.basketball-reference.com/leagues/NBA_{}_rookies-season-stats.html"
In creating the two variables referenced above, our thought process is as follows.
- The appropriate packages are imported
url_base
serves to store the pre-formatted string variable of the target URL- The
years
list variable specifies the ranged of the desired years, 1990 up to 2017
Downloading the Page Data
In scraping web pages, it’s imperative to remove as much overhead as possible. Seeing as the site stores all their information on the HTML front end, the page can be easily downloaded and locally stored in its entirety.
# iterates through each year and downloads page into an HTML file for year in years: url = url_base.format(year) data = requests.get(url) # page is save as an html and placed in Rookies folder with open("notebooks/Rookies/{}.html".format(year), "w+") as f: f.write(data.text)
The for
loop iterates through the list variable years
.
The curly braces found within the url
’s string will serve to allow the format to substitute it with the currently iterated year.
For example, in its first iteration, the url
value will be 'https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html'
.
On its second iteration, the subsequent year would be referenced instead (https://www.basketball-reference.com/leagues/NBA_1991_rookies-season-stats.html)
The data
variable acts as a placeholder for the requests.get()
function and references of the currently iterated url
string value.
The requests method then uses the newly formatted URL string to retrieve the page in question.
The subsequent with open()
reads and writes (w+
) the page data from our requests.get (data.text)
, and locally stores the newly created HTML files.
Why download the page and store it locally?
To avoid a common growing pain in site scraping, we store these pages as local HTML files.
See, when making a visit to a page site, the server hosting said page has to honor your request and send back the appropriate data to your browser. But having one specific client asking for the same information over and over puts undue strain on the server.
The server admin is well within their rights to block these persistent requests for the sake of being able to optimally provide this service to others online.
By downloading these HTML files on your local machine, you avoid two things:
- Having to wait longer than usual to collect the same data
- Being blocked from visiting the page, halting data collection altogether
Identifying Important Page Elements
To scrape data elements of these recently downloaded pages using Python, there needs to be a means to understand what properties these HTML elements have. In order to identify these properties, we need to inspect the page itself.
How to Inspect
We’ll need to dive deeper into the inner workings of this document, but I promise I won’t make this an exercise on learning HTML.
If you know how to inspect HTML objects, feel free to jump ahead. Otherwise, please follow along on how to inspect page elements.
Option 1: Developer Tools
- Click on the three vertical dots on Chrome’s top menu bar
- Choose “More tools”
- Select Developer tools.
Option 2: Menu Select
- Right-click on the web page
- Choose “Inspect” to access the Developer tools panel
Inspecting the Page
Seeing that all of these pages are locally stored, we can choose to view them by either going into the file system to open them in our desired browser, or, we can continue to build our code by implementing the following snippet of code.
with open("notebooks/Rookies/2000.html") as f: page = f.read()
Below is the loaded page with Developer Tools docked to the right. Notice how hovering the mouse cursor on the HTML line containing the class ID rookies highlights the table element on the page?
All the desired data of this page is housed in that table element. Before hastily sucking up all of this data as is, now is the best time to consider whether everything on this table is worth collecting.
Pre-Clean
Pre-cleaning might not be a frequent word in your vocabulary, but for those of you seeing yourself scraping data regularly, it should be. If you want to avoid the frustration of wasted hours of progress on a data collection project, it’s best to first separate the chaff from the wheat.
For instance, take note of the three elements boxed in red.
One row serves as the “main” table header. The other two rows are duplicate instances of the same artifacts found at the top. This pattern repeats every 20th row.
Upon further inspection of these elements, it’s revealed that all of these rows have the same tr
(table row) HTML tag. What distinguishes each of these elements from any others are their class names.
- Main Header Row
a.Class = over_header
- Repeat Header Rows
a.Class = over_header thead
- Statistics Category Row
a.Class = thead
# array to house list of dataframes dfs = [] # unnecessary table rows to be removed classes = ["over_header", "over_header thead", "thead"]
dfs
will be used later on to house several data frames- The
classes
array object will hold all the unwanted table row element’s class names.
Knowing that these elements provide no statistical value, rather than simply “skipping over” them in our parse, they should instead be completely omitted. That’s to say, permanently removed from any future considerations.
The decompose
method serves to remove unwanted elements in a page. As per the official Beautiful Soup page.
decompose()
Tag.decompose()
removes a tag from the tree, then completely destroys it and its contents.
Below is a snippet of code where the decompose
method is optimized using multiple for
loops.
# for loop to iterate through the years for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") # for loop cleans up unnecessary table # headers from reappearing in rows for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose()
- First
for
loop is used to iterate through the values of ouryears
list object - The
with
method provides our code the structure for the page variable to read locally stored HTML files when called on - An HTML parser class is initialized by instantiating the BeautifulSoup class and passing in both the page string object and
html.parser
. - Second
for
loop iterates through the values in the classes array - Third
for
loop utilizes Beautiful Soup’sfind_all
method to identify elements that have bothtr
tags and class names matching those in classes tr.decompose
serves to omit each of the identified table row elements from the page entirely
Let’s look to build on this by extracting the data we do want.
Extracting the Data
We can finally start working on the part of the code that actually extracts data from the table.
Remember that the table in with all of the relevant data has the HTML unique ID rookies. The following additions to our code will serve to parse the data of this table.
# the years we wish to parse for years = list(range(1990, 2017)) # array to house list of dataframes dfs = [] # unnecessary table headers to be removed classes = ["over_header","over_header thead", "thead"] for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") #for loop cleans up unnecessary table headers from reappearing in rows for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose() ### Start Scraping Block ### #identifies, scrapes, and loads rookie tables into one dataframe rookie_table = soup.find(id="rookies") rookies = pd.read_html(str(rookie_table))[0] rookies["Year"] = year dfs.append(rookies) # new variable turns list of dataframes into single dataframe all_rookies = pd.concat(dfs)
For what follows ### Start Scraping Block ###
- The
rookie_table
variable serves to help identify this, and only this table on the page - Seeing that the Pandas package can read HTML tables, the rookie table is loaded into Pandas using the
read_html
method, passing therookie_table
as a string - Tacking on to end
[0]
to turn it from a list of dataframes into a single dataframe - A “
Year
” column is added to therookies
dataframe dfs.append(rookies)
serves to house all of tables of every rookie year in the order they were iterated into a list of dataframes- The Pandas method
concat
is used to combine that list of dataframes into one single dataframe:all_rookies
Archiving
Our final step involves taking all of this useful, clean information and archiving it in easily readable CSV format. Tacking on this line to the end of our code (outside of any loops!) will serve to be useful when deciding to come back and reference the data collected.
# dataframe archived as local CSV all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")
Final Product
import requests import pandas as pd from bs4 import BeautifulSoup # the years we wish to parse for years = list(range(1990, 2017)) # array to house list of dataframes dfs = [] # unnecessary table headers to be removed classes = ["over_header","over_header thead", "thead"] # loop iterates through years for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") #second for loop clears unnecessary table headers for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose() # identifies, scrapes, and loads rookie tables into one dataframe table_rookies = soup.find(id="rookies") rookies = pd.read_html(str(table_rookies))[0] rookies["Year"] = year dfs.append(rookies) #new variable turns list of dataframes into single dataframe all_rookies = pd.concat(dfs) #dataframe archived as local CSV all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")
Closing
Again, the process followed in this walkthrough will undoubtedly apply to most every other page on Basketball-Reference.com.
There are five simple steps worth taking in each instance.
- Identify the Page URL
- Download the Page
- Identify the Elements
- Pre-Clean and Extract
- Archive
Following these five steps will help guarantee a quick and successful scraping experience.
Next up in this series will be actually using this data to gain insight into future player potential. So be on the lookout for future installments!
We’ll share them here: