Project Motivation
My wife and I are pretty discerning about which movies we allow our two daughters (ages 4 and 5) to watch.
Recently, we were in conversation with their teachers at school about assembling a good list of age-appropriate movies. To simplify the process, I decided to build a database of movie ratings that is easily sortable/filterable by scraping information from relevant websites.

There are a few websites that we use to determine whether a movie is age-appropriate, but one of our favorites is Kids-In-Mind, so I decided to start there. Kids-In-Mind provides a ranking from 0 (none) to 10 (extreme) for a movie’s sex, violence, and foul language content. I set out to pull all of these ratings and condense them into a single Excel sheet that I could sort and filter however I like.
What You Will Learn
This article is written for someone familiar with Python, but who is a beginner at web scraping. This HTML cheat sheet may be a helpful resource for quickly looking up different HTML tags.

In this article, you will learn how I:
- Came up with a plan for scraping data from Kids-In-Mind
- Examined the HTML for the relevant web pages
- Used BeautifulSoup to parse the HTML for movie rating information
- Handled variations in how pages were organized
- Used pandas to write the resulting data to a CSV file
In the rest of the article, I will abbreviate BeautifulSoup as bs4.
You can download the full script here https://github.com/finxter/WebScrapeKidsMovies. I also attach the full script to the end of this page, so keep reading! 👇
Planning the Scraping Approach
First, things first, how should I get started? When I visited the Kids-In-Mind home page, I noticed that they have a link to an “A-Z Index.” Jackpot! I realized I could visit each “letter” page and either follow links or pull information to get the data I needed.

I was pleasantly surprised again when I visited the “A” page. The title, MPAA rating, year, and content ratings were all contained right there on the page! I decided to pull the HTML from each “letter” page and then parse that HTML to scrape information for each movie.
Clicking on the links to the “A” and “B” pages took me to the following URLs:
As you can see, simply exchanging the “a” for the “b” allowed me to navigate to each “letter” page on the site. This is how I decided to iterate through pages to pull information for all the movies on the site.

To proceed, I still needed to figure out how each page was structured. I right-clicked on the first movie (Abandon) and selected the “Inspect” option (I’m using Google Chrome).
You can see that:
- The list of movies is contained within a
<div>tag with an attributeclass = "et_pb_text_inner"(1), - The link and movie titles are each contained with an
<a>tag (2), and - and the year and ratings are contained within text trailing each
<a>tag (3).
💡 Note: Since I’m new to HTML, I initially thought the text with rating information was associated with each <a> tag. Upon closer inspection using BeautifulSoup, I found out that the text was actually associated with the <div> tag. You’ll see that in the code, further down.

In addition to the number of ratings for each content category, I also wanted to pull more detailed information about the sex content.
Since my kids are so young, sometimes even movies with low sex ratings can be inappropriate for them. For example, the movie might be aimed at a 10-year-old even though it is rated G with a sex rating of 1.
To get this content, I needed to follow each movie link to that movie’s page. I clicked on the “The Adventures of Rocky and Bullwinkle” link and used the “Inspect” tool to check out the HTML defining the movie’s “Sex/Nudity” section.
You can see:
- There is a
<span>tag (2) nested inside of a<p>tag (1), - the
<span>tag contains the paragraph heading, “Sex/Nudity” (3), - and the text (4) trails the
<span>tag.

Now that I had visited a few relevant pages from the site and inspected the underlying HTML, I was able to define a general approach:
Scrape movie titles and ratings:
- Loop through each “letter” page and pull the HTML
- Use BeautifulSoup to find all
<div>tags withclass = "et_pb_text_inner" - Determine which
<div>tag contains the list of movies - Get the text from the
<div>tag and parse it for movies names and information - Loop through each nested
<a>tag and get the URL leading to each movie page (the value of thehrefattribute)
Scrape sexual content description:
- Follow the
hrefattribute contained in each<a>tag (contains the link to that movie’s page) - Use BeautifulSoup to find all
<p>tags - Loop through
<p>tags until I find one that contains the text “SEX/NUDITY” - Extract the text
Organize data and save it to a file:
- Build a dictionary containing keys for each piece of information (title, year, rating, etc.)
- Convert the dictionary to a pandas data frame
- Write the data frame to a CSV file
Scraping the Movie Titles and Ratings

The import statements needed for the code shown in this section are:
import requests from bs4 import BeautifulSoup import time from urllib.parse import urljoin
I decided to call the main function scrape_kim_ratings(), and I gave it an input of all of the letter pages I wanted to scrape.
Next, I initialized the dictionary containing all the movie information, which would be converted to a pandas data frame.
The dictionary keys become the data frame column titles after conversion:
def scrape_kim_ratings(letters):
movie_dict = {"title": [],
"year": [],
"mpaa": [],
"KIM sex": [],
"KIM violence": [],
"KIM language": [],
"KIM sex content": []}
Next, I defined a for loop to loop through each letter page and pull the HTML from each page using the requests.get() method. Once I had the HTML, I used BeatifulSoup to find all <div> tags with an attribute class = "et_pb_text_inner":
for letter in letters:
# Get a response from each letter page
url = f"https://kids-in-mind.com/{letter}.htm"
res = requests.get(url)
if res:
# Get the HTML from that page
soup = BeautifulSoup(res.text, "html.parser")
# The list of movies is in a div tag with class = et_pb_text_inner
div = soup.findAll("div", class_="et_pb_text_inner")
As it turns out, the letter pages contained multiple tags matching these criteria, so I had to figure out which tag contained the list of movies.
You’ll see that I looped through each of the div tags (for entry in div:), used the bs4 getText() method to pull the entry’s text, and looked to see if the text contained “Movie Reviews by Title.”
The next tag contained the list of movies – I had figured this out by inspecting the HTML of a few of the letter pages. In the code below, idx is the index of the tag containing the list of movies:
# Find the list of movies. It comes after "Movie Reviews by Title"
idx = 0
for entry in div:
text = entry.getText()
if "Movie Reviews by Title" in text:
idx += 1
break
idx += 1
Next, I used the bs4 getText() method to get a string of all the text from the <div> tag with the list of movies. The object stored in div[idx] is an instance of the bs4.element.Tag class, which means we can think of it as a <div> tag that can be parsed and manipulated with bs4 functions and methods.
You can use Python’s type() function to determine this. I used the type() function heavily while I was figuring out how the bs4 functions worked and what their outputs were.
All the movies were separated by newline characters, so I used the split() method to get a list containing a different movie in each entry:
# All movies on the page, separated by \n
# (movie names with ratings are stored as text of the div tag)
movies = div[idx].getText().split("\n")
To be honest, at first, I didn’t know that all the movies were stored as text within the <div> tag. I thought I was going to have to pull the text from each <a> tag within the <div> tag.
However, using the PyCharm debugger to play around with div[idx], I discovered that pulling the text from the <div> tag provided me with the movie information.
Next, I needed to get the links that would take me to each movie page. I used the findAll() method to get all <a> tags and then used the urljoin() function to join the URL of the current “letter” web page (like https://kids-in-mind.com/a.htm) with the relative link to the movie page (like /a/abandon.htm).
An example result is https://kids-in-mind.com/a/abandon.htm. I used list comprehension to put them all in a list, links:
# href links to each movie page are stored in a tags
a = div[idx].findAll("a")
links = [urljoin(url, x["href"]) for x in a]
Now I had all of the movie rating information for a given letter page and all the links to the movie pages. The next steps were to:
- Parse each string in
moviesfor each rating and other pieces of information - Follow each link in
linksand parse the sexual content
To make it easier to loop through both lists at once, I used the zip() function:
# zip these up to make iteration easier in the for loop
movies_and_links = list(zip(movies, links))
Next, I looped through each movie and each link. First, I parsed the string in movie for the year, MPAA rating, Kids In Mind ratings, and the movie title using a function that I defined called parse_movie():
for movie, link in movies_and_links:
# get the information available in the list on each letter page
year, mpaa, ratings, title = parse_movie(movie)
print(f"Title is {title}")
This function took a bit of trial and error to write.
At first, I thought all of the strings were formatted like, "Abandon [2002] [PG-13] – 4.4.4".
However, after running the code once, I saw that some of the strings were formatted like this, "Abandon [Foreign Name] [2002] [PG-13] – 4.4.4", with an additional set of brackets containing the film’s name in a different language.
I had to add the code block at the very beginning of the function to skip over this set of brackets.
You can see that the two main functions I used were the string methods find() (to find the brackets) and split() (to isolate the Kids In Mind ratings).
The last tricky bit that gave me trouble was that sometimes the Kids In Mind ratings were separated by an en dash and other times were separated by an em dash:
def parse_movie(movie):
# some entries had a foreign name in brackets
if movie.count("]") > 2:
start_idx = movie.find("]") + 1
else:
start_idx = 0
# year is usually in the first set of brackets
year_idx1 = movie.find("[", start_idx)
year_idx2 = movie.find("]", start_idx)
# mpaa rating was next
mpaa_idx1 = movie.find("[", year_idx1 + 1)
mpaa_idx2 = movie.find("]", year_idx2 + 1)
year = int(movie[year_idx1 + 1:year_idx2].strip())
mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]
# the ratings came after a dash and were formatted like #.#.#
ratings_split = movie.split("–")
# sometimes they used a dash, sometimes an en dash
if len(ratings_split) == 1:
ratings_split = movie.split("-")
ratings = [int(x) for x in ratings_split[-1].split(".")]
title = movie[0:year_idx1]
return year, mpaa, ratings, title
Scrape sexual content description
The additional import statements needed for the code in this section are:
import bs4.element import time import random
After parsing movie, it was time to follow the link to the movie’s page and pull a more detailed description of sexual content using the function scrape_kim_sexcontent().
Since this was going to require making many “get” requests to the Kids In Mind website, I also added a variable time delay in between each request using the time.sleep() function. I did this for two reasons:
- It’s good practice to add some sort of delay between requests so that you do not overload the website’s server.
- Adding a bit of random variation to the time delays can trick the web server into thinking your web scraping script is a human, making it less likely to reject subsequent requests.
Code:
# follow each movie link to get the sex content description
start = time.time()
sex_content = scrape_kim_sexcontent(link)
delay = time.time() - start
wait_time = random.uniform(.5, 2) * delay
print(f'Just finished {title}')
print(f'wait time is {wait_time}')
time.sleep(wait_time)
Scraping the detailed descriptions proved a bit trickier than getting the Kids In Mind ratings. As I mentioned above, I planned to use the bs4 object method findAll() to get all of the <p> tags and find the one that contained sexual content.
Below is the first iteration of my scrape_kim_sexcontent() function:
def scrape_kim_sexcontent(url):
# Request html from page and find all p tags
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
res.close()
p_set = soup.findAll("p")
for entry in p_set:
if 'SEX/NUDITY' in entry.text:
sex_content = entry.text
break
return sex_content
However, I quickly realized that some of the movie pages were organized differently. The screenshot below shows a resulting CSV file. You can see that the script pulled a paragraph from the right side of the web page instead of the sexual content paragraph.


It turns out that some of the movie pages, like the one for Abominable, had the title and text “SEX/NUDITY” in an <h2> tag preceding the <p> tag that contained the detailed description.

To handle this variation, I added some code. The final version of scrape_kim_sexcontent() is below. First, I looked for all of the <h2> tags. Then I looped through them until I found one with an id attribute equal to “sex”. I used the bs4.element.tag attribute, attrs, to access each tag’s attributes as a dictionary.
If you take another look at the Abominable page HTML, you can see that the <p> tag containing the sexual content details is at the same level as the preceding <h2> tag rather than being nested within it.
This means that the <p> tag is a sibling of the <h2> tag, not its child. Thus, I was able to access it using the bs4.element.tag attribute next_siblings which returns a list of the siblings that follow the <h2> tag.
Finally, I used the bs4.element.tag attribute text to get the paragraph I wanted:
def scrape_kim_sexcontent(url):
# Request html from page and find all h2 tags
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
res.close()
h2_set = soup.findAll("h2")
# Initialize
sex_content = ""
# Check the <h2> tags (headers). If you find id="sex", grab the next paragraph (p tag)
sibling_iter = []
for entry in h2_set:
if "id" in entry.attrs:
if entry["id"] == "sex":
sibling_iter = entry.next_siblings
# Grab the next paragraph
for sibling in sibling_iter:
if type(sibling) == bs4.element.Tag:
sex_content = sibling.text
# Sometimes header <h2> tags aren't used to make the paragraph headers
# If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
if sex_content == "":
p_set = soup.findAll("p")
for entry in p_set:
if 'SEX/NUDITY' in entry.text:
sex_content = entry.text
break
return sex_content
Organize Data and Save it to a File

The additional import statements needed for the code in this section are:
import pandas as pd import string
Finally, it was time to organize the scraped data and save it to a CSV file.
I decided to use the pandas library since its to_csv data frame method makes it super easy to save data to a CSV file.
First, after parsing the information for each movie, I saved each piece of data in a dictionary. After each “letter” page was completed, I converted the growing dictionary to a pandas data frame using the pd.DataFrame() method and then saved the resulting data frame to a CSV file.
I decided to write to the CSV file after each “letter” page was completed to make sure that I would have data saved if the web scraping script was interrupted for some reason:
# Build dictionary for conversion to data frame
movie_dict["title"].append(title)
movie_dict["year"].append(year)
movie_dict["mpaa"].append(mpaa)
movie_dict["KIM sex"].append(ratings[0])
movie_dict["KIM violence"].append(ratings[1])
movie_dict["KIM language"].append(ratings[2])
movie_dict["KIM sex content"].append(sex_content)
res.close()
# Write to the CSV after every letter
print("\n")
print("Writing to Movies.csv")
df_movies = pd.DataFrame(movie_dict)
df_movies.to_csv("Movies.csv")
print(f"Done with {letter}. Waiting {wait_time} seconds")
time.sleep(wait_time)
else:
print(f"Error: {res}")
return df_movies
Lastly, I called the main function scrape_kim_ratings() and provided a list of all the lowercase ASCII letters:
df_movies = scrape_kim_ratings(string.ascii_lowercase)
Conclusion

So, there you have it! Here is a link to the GitHub page with the full script https://github.com/finxter/WebScrapeKidsMovies. I’ll also attach it at the end of this article.
In the future, I think I will add functions to the script that will pull information from other websites and add them to the current database. I think I will also add a function that checks the websites for any new movies/ratings and adds them to the current database.
I hope this will inspire you to write your own web scraping script!
💡 Recommended: Basketball Statistics – Page Scraping Using Python and BeautifulSoup
The Script
import pandas as pd
import requests
from bs4 import BeautifulSoup
import bs4.element
import string
import time
from urllib.parse import urljoin
import random
def scrape_kim_sexcontent(url):
# Request html from page and find all h2 tags
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
res.close()
h2_set = soup.findAll("h2")
# Initialize
sex_content = ""
# Check the <h2> tags (headers). If you find id="sex", grab the next paragraph (p tag)
sibling_iter = []
for entry in h2_set:
if "id" in entry.attrs:
if entry["id"] == "sex":
sibling_iter = entry.next_siblings
# Grab the next paragraph
for sibling in sibling_iter:
if type(sibling) == bs4.element.Tag:
sex_content = sibling.text
# Sometimes header <h2> tags aren't used to make the paragraph headers
# If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
if sex_content == "":
p_set = soup.findAll("p")
for entry in p_set:
if 'SEX/NUDITY' in entry.text:
sex_content = entry.text
break
return sex_content
def parse_movie(movie):
# some entries had a foreign name in brackets
if movie.count("]") > 2:
start_idx = movie.find("]") + 1
else:
start_idx = 0
# year is usually in the first set of brackets
year_idx1 = movie.find("[", start_idx)
year_idx2 = movie.find("]", start_idx)
# mpaa rating was next
mpaa_idx1 = movie.find("[", year_idx1 + 1)
mpaa_idx2 = movie.find("]", year_idx2 + 1)
year = int(movie[year_idx1 + 1:year_idx2].strip())
mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]
# the ratings came after a dash and were formatted like #.#.#
ratings_split = movie.split("–")
# sometimes they used a dash, sometimes an en dash
if len(ratings_split) == 1:
ratings_split = movie.split("-")
ratings = [int(x) for x in ratings_split[-1].split(".")]
title = movie[0:year_idx1]
return year, mpaa, ratings, title
def scrape_kim_ratings(letters):
movie_dict = {"title": [],
"year": [],
"mpaa": [],
"KIM sex": [],
"KIM violence": [],
"KIM language": [],
"KIM sex content": []}
for letter in letters:
# Get a response from each letter page
url = f"https://kids-in-mind.com/{letter}.htm"
res = requests.get(url)
if res:
# Get the HTML from that page
soup = BeautifulSoup(res.text, "html.parser")
# The list of movies is in a div tag with class = et_pb_text_inner
div = soup.findAll("div", class_="et_pb_text_inner")
# Find the list of movies. It comes after "Movie Reviews by Title"
idx = 0
for entry in div:
text = entry.getText()
if "Movie Reviews by Title" in text:
idx += 1
break
idx += 1
# All movies on the page, separated by \n (movie names with ratings are stored as text of the div tag)
movies = div[idx].getText().split("\n")
# href links to each movie page are stored in a tags
a = div[idx].findAll("a")
links = [urljoin(url, x["href"]) for x in a]
# zip these up to make iteration easier in the for loop
movies_and_links = list(zip(movies, links))
for movie, link in movies_and_links:
# get the information available in the list on each letter page
year, mpaa, ratings, title = parse_movie(movie)
print(f"Title is {title}")
# follow each movie link to get the sex content description
start = time.time()
sex_content = scrape_kim_sexcontent(link)
delay = time.time() - start
wait_time = random.uniform(.5, 2) * delay
print(f'Just finished {title}')
print(f'wait time is {wait_time}')
time.sleep(wait_time)
# Build dictionary for conversion to data frame
movie_dict["title"].append(title)
movie_dict["year"].append(year)
movie_dict["mpaa"].append(mpaa)
movie_dict["KIM sex"].append(ratings[0])
movie_dict["KIM violence"].append(ratings[1])
movie_dict["KIM language"].append(ratings[2])
movie_dict["KIM sex content"].append(sex_content)
res.close()
# Write to the CSV after every letter
print("\n")
print("Writing to Movies.csv")
df_movies = pd.DataFrame(movie_dict)
df_movies.to_csv("Movies.csv")
print(f"Done with {letter}. Waiting {wait_time} seconds")
time.sleep(wait_time)
else:
print(f"Error: {res}")
return df_movies
df_movies = scrape_kim_ratings(string.ascii_lowercase)
