Project Motivation
My wife and I are pretty discerning about which movies we allow our two daughters (ages 4 and 5) to watch.
Recently, we were in conversation with their teachers at school about assembling a good list of age-appropriate movies. To simplify the process, I decided to build a database of movie ratings that is easily sortable/filterable by scraping information from relevant websites.
There are a few websites that we use to determine whether a movie is age-appropriate, but one of our favorites is Kids-In-Mind, so I decided to start there. Kids-In-Mind provides a ranking from 0 (none) to 10 (extreme) for a movie’s sex, violence, and foul language content. I set out to pull all of these ratings and condense them into a single Excel sheet that I could sort and filter however I like.
What You Will Learn
This article is written for someone familiar with Python, but who is a beginner at web scraping. This HTML cheat sheet may be a helpful resource for quickly looking up different HTML tags.
In this article, you will learn how I:
- Came up with a plan for scraping data from Kids-In-Mind
- Examined the HTML for the relevant web pages
- Used BeautifulSoup to parse the HTML for movie rating information
- Handled variations in how pages were organized
- Used pandas to write the resulting data to a CSV file
In the rest of the article, I will abbreviate BeautifulSoup as bs4.
You can download the full script here https://github.com/finxter/WebScrapeKidsMovies. I also attach the full script to the end of this page, so keep reading! 👇
Planning the Scraping Approach
First, things first, how should I get started? When I visited the Kids-In-Mind home page, I noticed that they have a link to an “A-Z Index.” Jackpot! I realized I could visit each “letter” page and either follow links or pull information to get the data I needed.
I was pleasantly surprised again when I visited the “A” page. The title, MPAA rating, year, and content ratings were all contained right there on the page! I decided to pull the HTML from each “letter” page and then parse that HTML to scrape information for each movie.
Clicking on the links to the “A” and “B” pages took me to the following URLs:
As you can see, simply exchanging the “a” for the “b” allowed me to navigate to each “letter” page on the site. This is how I decided to iterate through pages to pull information for all the movies on the site.
To proceed, I still needed to figure out how each page was structured. I right-clicked on the first movie (Abandon) and selected the “Inspect” option (I’m using Google Chrome).
You can see that:
- The list of movies is contained within a
<div>
tag with an attributeclass = "et_pb_text_inner"
(1), - The link and movie titles are each contained with an
<a>
tag (2), and - and the year and ratings are contained within text trailing each
<a>
tag (3).
💡 Note: Since I’m new to HTML, I initially thought the text with rating information was associated with each <a>
tag. Upon closer inspection using BeautifulSoup, I found out that the text was actually associated with the <div>
tag. You’ll see that in the code, further down.
In addition to the number of ratings for each content category, I also wanted to pull more detailed information about the sex content.
Since my kids are so young, sometimes even movies with low sex ratings can be inappropriate for them. For example, the movie might be aimed at a 10-year-old even though it is rated G with a sex rating of 1.
To get this content, I needed to follow each movie link to that movie’s page. I clicked on the “The Adventures of Rocky and Bullwinkle” link and used the “Inspect” tool to check out the HTML defining the movie’s “Sex/Nudity” section.
You can see:
- There is a
<span>
tag (2) nested inside of a<p>
tag (1), - the
<span>
tag contains the paragraph heading, “Sex/Nudity” (3), - and the text (4) trails the
<span>
tag.
Now that I had visited a few relevant pages from the site and inspected the underlying HTML, I was able to define a general approach:
Scrape movie titles and ratings:
- Loop through each “letter” page and pull the HTML
- Use BeautifulSoup to find all
<div>
tags withclass = "et_pb_text_inner"
- Determine which
<div>
tag contains the list of movies - Get the text from the
<div>
tag and parse it for movies names and information - Loop through each nested
<a>
tag and get the URL leading to each movie page (the value of thehref
attribute)
Scrape sexual content description:
- Follow the
href
attribute contained in each<a>
tag (contains the link to that movie’s page) - Use BeautifulSoup to find all
<p>
tags - Loop through
<p>
tags until I find one that contains the text “SEX/NUDITY” - Extract the text
Organize data and save it to a file:
- Build a dictionary containing keys for each piece of information (title, year, rating, etc.)
- Convert the dictionary to a pandas data frame
- Write the data frame to a CSV file
Scraping the Movie Titles and Ratings
The import
statements needed for the code shown in this section are:
import requests from bs4 import BeautifulSoup import time from urllib.parse import urljoin
I decided to call the main function scrape_kim_ratings()
, and I gave it an input of all of the letter pages I wanted to scrape.
Next, I initialized the dictionary containing all the movie information, which would be converted to a pandas data frame.
The dictionary keys become the data frame column titles after conversion:
def scrape_kim_ratings(letters): movie_dict = {"title": [], "year": [], "mpaa": [], "KIM sex": [], "KIM violence": [], "KIM language": [], "KIM sex content": []}
Next, I defined a for loop to loop through each letter page and pull the HTML from each page using the requests.get()
method. Once I had the HTML, I used BeatifulSoup to find all <div>
tags with an attribute class = "et_pb_text_inner"
:
for letter in letters: # Get a response from each letter page url = f"https://kids-in-mind.com/{letter}.htm" res = requests.get(url) if res: # Get the HTML from that page soup = BeautifulSoup(res.text, "html.parser") # The list of movies is in a div tag with class = et_pb_text_inner div = soup.findAll("div", class_="et_pb_text_inner")
As it turns out, the letter pages contained multiple tags matching these criteria, so I had to figure out which tag contained the list of movies.
You’ll see that I looped through each of the div
tags (for entry in div:
), used the bs4 getText()
method to pull the entry’s text, and looked to see if the text contained “Movie Reviews by Title.”
The next tag contained the list of movies – I had figured this out by inspecting the HTML of a few of the letter pages. In the code below, idx
is the index of the tag containing the list of movies:
# Find the list of movies. It comes after "Movie Reviews by Title" idx = 0 for entry in div: text = entry.getText() if "Movie Reviews by Title" in text: idx += 1 break idx += 1
Next, I used the bs4 getText()
method to get a string of all the text from the <div>
tag with the list of movies. The object stored in div[idx]
is an instance of the bs4.element.Tag
class, which means we can think of it as a <div>
tag that can be parsed and manipulated with bs4 functions and methods.
You can use Python’s type()
function to determine this. I used the type()
function heavily while I was figuring out how the bs4 functions worked and what their outputs were.
All the movies were separated by newline characters, so I used the split()
method to get a list containing a different movie in each entry:
# All movies on the page, separated by \n # (movie names with ratings are stored as text of the div tag) movies = div[idx].getText().split("\n")
To be honest, at first, I didn’t know that all the movies were stored as text within the <div>
tag. I thought I was going to have to pull the text from each <a>
tag within the <div>
tag.
However, using the PyCharm debugger to play around with div[idx]
, I discovered that pulling the text from the <div>
tag provided me with the movie information.
Next, I needed to get the links that would take me to each movie page. I used the findAll()
method to get all <a>
tags and then used the urljoin()
function to join the URL of the current “letter” web page (like https://kids-in-mind.com/a.htm) with the relative link to the movie page (like /a/abandon.htm).
An example result is https://kids-in-mind.com/a/abandon.htm. I used list comprehension to put them all in a list, links:
# href links to each movie page are stored in a tags a = div[idx].findAll("a") links = [urljoin(url, x["href"]) for x in a]
Now I had all of the movie rating information for a given letter page and all the links to the movie pages. The next steps were to:
- Parse each string in
movies
for each rating and other pieces of information - Follow each link in
links
and parse the sexual content
To make it easier to loop through both lists at once, I used the zip()
function:
# zip these up to make iteration easier in the for loop movies_and_links = list(zip(movies, links))
Next, I looped through each movie
and each link
. First, I parsed the string in movie
for the year, MPAA rating, Kids In Mind ratings, and the movie title using a function that I defined called parse_movie()
:
for movie, link in movies_and_links: # get the information available in the list on each letter page year, mpaa, ratings, title = parse_movie(movie) print(f"Title is {title}")
This function took a bit of trial and error to write.
At first, I thought all of the strings were formatted like, "Abandon [2002] [PG-13] – 4.4.4"
.
However, after running the code once, I saw that some of the strings were formatted like this, "Abandon [Foreign Name] [2002] [PG-13] – 4.4.4"
, with an additional set of brackets containing the film’s name in a different language.
I had to add the code block at the very beginning of the function to skip over this set of brackets.
You can see that the two main functions I used were the string methods find()
(to find the brackets) and split()
(to isolate the Kids In Mind ratings).
The last tricky bit that gave me trouble was that sometimes the Kids In Mind ratings were separated by an en dash and other times were separated by an em dash:
def parse_movie(movie): # some entries had a foreign name in brackets if movie.count("]") > 2: start_idx = movie.find("]") + 1 else: start_idx = 0 # year is usually in the first set of brackets year_idx1 = movie.find("[", start_idx) year_idx2 = movie.find("]", start_idx) # mpaa rating was next mpaa_idx1 = movie.find("[", year_idx1 + 1) mpaa_idx2 = movie.find("]", year_idx2 + 1) year = int(movie[year_idx1 + 1:year_idx2].strip()) mpaa = movie[mpaa_idx1 + 1:mpaa_idx2] # the ratings came after a dash and were formatted like #.#.# ratings_split = movie.split("–") # sometimes they used a dash, sometimes an en dash if len(ratings_split) == 1: ratings_split = movie.split("-") ratings = [int(x) for x in ratings_split[-1].split(".")] title = movie[0:year_idx1] return year, mpaa, ratings, title
Scrape sexual content description
The additional import statements needed for the code in this section are:
import bs4.element import time import random
After parsing movie
, it was time to follow the link to the movie’s page and pull a more detailed description of sexual content using the function scrape_kim_sexcontent()
.
Since this was going to require making many “get
” requests to the Kids In Mind website, I also added a variable time delay in between each request using the time.sleep()
function. I did this for two reasons:
- It’s good practice to add some sort of delay between requests so that you do not overload the website’s server.
- Adding a bit of random variation to the time delays can trick the web server into thinking your web scraping script is a human, making it less likely to reject subsequent requests.
Code:
# follow each movie link to get the sex content description start = time.time() sex_content = scrape_kim_sexcontent(link) delay = time.time() - start wait_time = random.uniform(.5, 2) * delay print(f'Just finished {title}') print(f'wait time is {wait_time}') time.sleep(wait_time)
Scraping the detailed descriptions proved a bit trickier than getting the Kids In Mind ratings. As I mentioned above, I planned to use the bs4 object method findAll()
to get all of the <p>
tags and find the one that contained sexual content.
Below is the first iteration of my scrape_kim_sexcontent()
function:
def scrape_kim_sexcontent(url): # Request html from page and find all p tags res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') res.close() p_set = soup.findAll("p") for entry in p_set: if 'SEX/NUDITY' in entry.text: sex_content = entry.text break return sex_content
However, I quickly realized that some of the movie pages were organized differently. The screenshot below shows a resulting CSV file. You can see that the script pulled a paragraph from the right side of the web page instead of the sexual content paragraph.
It turns out that some of the movie pages, like the one for Abominable, had the title and text “SEX/NUDITY” in an <h2>
tag preceding the <p>
tag that contained the detailed description.
To handle this variation, I added some code. The final version of scrape_kim_sexcontent()
is below. First, I looked for all of the <h2>
tags. Then I looped through them until I found one with an id attribute equal to “sex”. I used the bs4.element.tag
attribute, attrs
, to access each tag’s attributes as a dictionary.
If you take another look at the Abominable page HTML, you can see that the <p>
tag containing the sexual content details is at the same level as the preceding <h2>
tag rather than being nested within it.
This means that the <p>
tag is a sibling of the <h2>
tag, not its child. Thus, I was able to access it using the bs4.element.tag
attribute next_siblings
which returns a list of the siblings that follow the <h2>
tag.
Finally, I used the bs4.element.tag
attribute text to get the paragraph I wanted:
def scrape_kim_sexcontent(url): # Request html from page and find all h2 tags res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') res.close() h2_set = soup.findAll("h2") # Initialize sex_content = "" # Check the <h2> tags (headers). If you find id="sex", grab the next paragraph (p tag) sibling_iter = [] for entry in h2_set: if "id" in entry.attrs: if entry["id"] == "sex": sibling_iter = entry.next_siblings # Grab the next paragraph for sibling in sibling_iter: if type(sibling) == bs4.element.Tag: sex_content = sibling.text # Sometimes header <h2> tags aren't used to make the paragraph headers # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY" if sex_content == "": p_set = soup.findAll("p") for entry in p_set: if 'SEX/NUDITY' in entry.text: sex_content = entry.text break return sex_content
Organize Data and Save it to a File
The additional import statements needed for the code in this section are:
import pandas as pd import string
Finally, it was time to organize the scraped data and save it to a CSV file.
I decided to use the pandas library since its to_csv
data frame method makes it super easy to save data to a CSV file.
First, after parsing the information for each movie, I saved each piece of data in a dictionary. After each “letter” page was completed, I converted the growing dictionary to a pandas data frame using the pd.DataFrame()
method and then saved the resulting data frame to a CSV file.
I decided to write to the CSV file after each “letter” page was completed to make sure that I would have data saved if the web scraping script was interrupted for some reason:
# Build dictionary for conversion to data frame movie_dict["title"].append(title) movie_dict["year"].append(year) movie_dict["mpaa"].append(mpaa) movie_dict["KIM sex"].append(ratings[0]) movie_dict["KIM violence"].append(ratings[1]) movie_dict["KIM language"].append(ratings[2]) movie_dict["KIM sex content"].append(sex_content) res.close() # Write to the CSV after every letter print("\n") print("Writing to Movies.csv") df_movies = pd.DataFrame(movie_dict) df_movies.to_csv("Movies.csv") print(f"Done with {letter}. Waiting {wait_time} seconds") time.sleep(wait_time) else: print(f"Error: {res}") return df_movies
Lastly, I called the main function scrape_kim_ratings()
and provided a list of all the lowercase ASCII letters:
df_movies = scrape_kim_ratings(string.ascii_lowercase)
Conclusion
So, there you have it! Here is a link to the GitHub page with the full script https://github.com/finxter/WebScrapeKidsMovies. I’ll also attach it at the end of this article.
In the future, I think I will add functions to the script that will pull information from other websites and add them to the current database. I think I will also add a function that checks the websites for any new movies/ratings and adds them to the current database.
I hope this will inspire you to write your own web scraping script!
💡 Recommended: Basketball Statistics – Page Scraping Using Python and BeautifulSoup
The Script
import pandas as pd import requests from bs4 import BeautifulSoup import bs4.element import string import time from urllib.parse import urljoin import random def scrape_kim_sexcontent(url): # Request html from page and find all h2 tags res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') res.close() h2_set = soup.findAll("h2") # Initialize sex_content = "" # Check the <h2> tags (headers). If you find id="sex", grab the next paragraph (p tag) sibling_iter = [] for entry in h2_set: if "id" in entry.attrs: if entry["id"] == "sex": sibling_iter = entry.next_siblings # Grab the next paragraph for sibling in sibling_iter: if type(sibling) == bs4.element.Tag: sex_content = sibling.text # Sometimes header <h2> tags aren't used to make the paragraph headers # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY" if sex_content == "": p_set = soup.findAll("p") for entry in p_set: if 'SEX/NUDITY' in entry.text: sex_content = entry.text break return sex_content def parse_movie(movie): # some entries had a foreign name in brackets if movie.count("]") > 2: start_idx = movie.find("]") + 1 else: start_idx = 0 # year is usually in the first set of brackets year_idx1 = movie.find("[", start_idx) year_idx2 = movie.find("]", start_idx) # mpaa rating was next mpaa_idx1 = movie.find("[", year_idx1 + 1) mpaa_idx2 = movie.find("]", year_idx2 + 1) year = int(movie[year_idx1 + 1:year_idx2].strip()) mpaa = movie[mpaa_idx1 + 1:mpaa_idx2] # the ratings came after a dash and were formatted like #.#.# ratings_split = movie.split("–") # sometimes they used a dash, sometimes an en dash if len(ratings_split) == 1: ratings_split = movie.split("-") ratings = [int(x) for x in ratings_split[-1].split(".")] title = movie[0:year_idx1] return year, mpaa, ratings, title def scrape_kim_ratings(letters): movie_dict = {"title": [], "year": [], "mpaa": [], "KIM sex": [], "KIM violence": [], "KIM language": [], "KIM sex content": []} for letter in letters: # Get a response from each letter page url = f"https://kids-in-mind.com/{letter}.htm" res = requests.get(url) if res: # Get the HTML from that page soup = BeautifulSoup(res.text, "html.parser") # The list of movies is in a div tag with class = et_pb_text_inner div = soup.findAll("div", class_="et_pb_text_inner") # Find the list of movies. It comes after "Movie Reviews by Title" idx = 0 for entry in div: text = entry.getText() if "Movie Reviews by Title" in text: idx += 1 break idx += 1 # All movies on the page, separated by \n (movie names with ratings are stored as text of the div tag) movies = div[idx].getText().split("\n") # href links to each movie page are stored in a tags a = div[idx].findAll("a") links = [urljoin(url, x["href"]) for x in a] # zip these up to make iteration easier in the for loop movies_and_links = list(zip(movies, links)) for movie, link in movies_and_links: # get the information available in the list on each letter page year, mpaa, ratings, title = parse_movie(movie) print(f"Title is {title}") # follow each movie link to get the sex content description start = time.time() sex_content = scrape_kim_sexcontent(link) delay = time.time() - start wait_time = random.uniform(.5, 2) * delay print(f'Just finished {title}') print(f'wait time is {wait_time}') time.sleep(wait_time) # Build dictionary for conversion to data frame movie_dict["title"].append(title) movie_dict["year"].append(year) movie_dict["mpaa"].append(mpaa) movie_dict["KIM sex"].append(ratings[0]) movie_dict["KIM violence"].append(ratings[1]) movie_dict["KIM language"].append(ratings[2]) movie_dict["KIM sex content"].append(sex_content) res.close() # Write to the CSV after every letter print("\n") print("Writing to Movies.csv") df_movies = pd.DataFrame(movie_dict) df_movies.to_csv("Movies.csv") print(f"Done with {letter}. Waiting {wait_time} seconds") time.sleep(wait_time) else: print(f"Error: {res}") return df_movies df_movies = scrape_kim_ratings(string.ascii_lowercase)