I Built a Kids' Movie Ratings Database Using Beautiful Soup

Project Motivation

My wife and I are pretty discerning about which movies we allow our two daughters (ages 4 and 5) to watch.

Recently, we were in conversation with their teachers at school about assembling a good list of age-appropriate movies. To simplify the process, I decided to build a database of movie ratings that is easily sortable/filterable by scraping information from relevant websites.

There are a few websites that we use to determine whether a movie is age-appropriate, but one of our favorites is Kids-In-Mind, so I decided to start there. Kids-In-Mind provides a ranking from 0 (none) to 10 (extreme) for a movie’s sex, violence, and foul language content. I set out to pull all of these ratings and condense them into a single Excel sheet that I could sort and filter however I like.

What You Will Learn

This article is written for someone familiar with Python, but who is a beginner at web scraping. This HTML cheat sheet may be a helpful resource for quickly looking up different HTML tags.

In this article, you will learn how I:

Came up with a plan for scraping data from Kids-In-Mind
Examined the HTML for the relevant web pages
Used BeautifulSoup to parse the HTML for movie rating information
Handled variations in how pages were organized
Used pandas to write the resulting data to a CSV file

In the rest of the article, I will abbreviate BeautifulSoup as bs4.

You can download the full script here https://github.com/finxter/WebScrapeKidsMovies. I also attach the full script to the end of this page, so keep reading! 👇

Planning the Scraping Approach

First, things first, how should I get started? When I visited the Kids-In-Mind home page, I noticed that they have a link to an “A-Z Index.” Jackpot! I realized I could visit each “letter” page and either follow links or pull information to get the data I needed.

I was pleasantly surprised again when I visited the “A” page. The title, MPAA rating, year, and content ratings were all contained right there on the page! I decided to pull the HTML from each “letter” page and then parse that HTML to scrape information for each movie.

Clicking on the links to the “A” and “B” pages took me to the following URLs:

As you can see, simply exchanging the “a” for the “b” allowed me to navigate to each “letter” page on the site. This is how I decided to iterate through pages to pull information for all the movies on the site.

To proceed, I still needed to figure out how each page was structured. I right-clicked on the first movie (Abandon) and selected the “Inspect” option (I’m using Google Chrome).

You can see that:

The list of movies is contained within a <div> tag with an attribute class = "et_pb_text_inner" (1),
The link and movie titles are each contained with an <a> tag (2), and
and the year and ratings are contained within text trailing each <a> tag (3).

💡 Note: Since I’m new to HTML, I initially thought the text with rating information was associated with each <a> tag. Upon closer inspection using BeautifulSoup, I found out that the text was actually associated with the <div> tag. You’ll see that in the code, further down.

In addition to the number of ratings for each content category, I also wanted to pull more detailed information about the sex content.

Since my kids are so young, sometimes even movies with low sex ratings can be inappropriate for them. For example, the movie might be aimed at a 10-year-old even though it is rated G with a sex rating of 1.

To get this content, I needed to follow each movie link to that movie’s page. I clicked on the “The Adventures of Rocky and Bullwinkle” link and used the “Inspect” tool to check out the HTML defining the movie’s “Sex/Nudity” section.

You can see:

There is a <span> tag (2) nested inside of a <p> tag (1),
the <span> tag contains the paragraph heading, “Sex/Nudity” (3),
and the text (4) trails the <span> tag.

Now that I had visited a few relevant pages from the site and inspected the underlying HTML, I was able to define a general approach:

Scrape movie titles and ratings:

Loop through each “letter” page and pull the HTML
Use BeautifulSoup to find all <div> tags with class = "et_pb_text_inner"
Determine which <div> tag contains the list of movies
Get the text from the <div> tag and parse it for movies names and information
Loop through each nested <a> tag and get the URL leading to each movie page (the value of the href attribute)

Scrape sexual content description:

Follow the href attribute contained in each <a> tag (contains the link to that movie’s page)
Use BeautifulSoup to find all <p> tags
Loop through <p> tags until I find one that contains the text “SEX/NUDITY”
Extract the text

Organize data and save it to a file:

Build a dictionary containing keys for each piece of information (title, year, rating, etc.)
Convert the dictionary to a pandas data frame
Write the data frame to a CSV file

Scraping the Movie Titles and Ratings

The import statements needed for the code shown in this section are:

import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin

I decided to call the main function scrape_kim_ratings(), and I gave it an input of all of the letter pages I wanted to scrape.

Next, I initialized the dictionary containing all the movie information, which would be converted to a pandas data frame.

The dictionary keys become the data frame column titles after conversion:

def scrape_kim_ratings(letters):
   movie_dict = {"title": [],
                 "year": [],
                 "mpaa": [],
                 "KIM sex": [],
                 "KIM violence": [],
                 "KIM language": [],
                 "KIM sex content": []}

Next, I defined a for loop to loop through each letter page and pull the HTML from each page using the requests.get() method. Once I had the HTML, I used BeatifulSoup to find all <div> tags with an attribute class = "et_pb_text_inner":

   for letter in letters:
       # Get a response from each letter page
       url = f"https://kids-in-mind.com/{letter}.htm"
       res = requests.get(url)


       if res:
           # Get the HTML from that page
           soup = BeautifulSoup(res.text, "html.parser")
           # The list of movies is in a div tag with class = et_pb_text_inner
           div = soup.findAll("div", class_="et_pb_text_inner")

As it turns out, the letter pages contained multiple tags matching these criteria, so I had to figure out which tag contained the list of movies.

You’ll see that I looped through each of the div tags (for entry in div:), used the bs4 getText() method to pull the entry’s text, and looked to see if the text contained “Movie Reviews by Title.”

The next tag contained the list of movies – I had figured this out by inspecting the HTML of a few of the letter pages. In the code below, idx is the index of the tag containing the list of movies:

           # Find the list of movies. It comes after "Movie Reviews by Title"
           idx = 0
           for entry in div:
               text = entry.getText()
               if "Movie Reviews by Title" in text:
                   idx += 1
                   break
               idx += 1

Next, I used the bs4 getText() method to get a string of all the text from the <div> tag with the list of movies. The object stored in div[idx] is an instance of the bs4.element.Tag class, which means we can think of it as a <div> tag that can be parsed and manipulated with bs4 functions and methods.

You can use Python’s type() function to determine this. I used the type() function heavily while I was figuring out how the bs4 functions worked and what their outputs were.

All the movies were separated by newline characters, so I used the split() method to get a list containing a different movie in each entry:

           # All movies on the page, separated by \n 
           # (movie names with ratings are stored as text of the div tag)
           movies = div[idx].getText().split("\n")

To be honest, at first, I didn’t know that all the movies were stored as text within the <div> tag. I thought I was going to have to pull the text from each <a> tag within the <div> tag.

However, using the PyCharm debugger to play around with div[idx], I discovered that pulling the text from the <div> tag provided me with the movie information.

Next, I needed to get the links that would take me to each movie page. I used the findAll() method to get all <a> tags and then used the urljoin() function to join the URL of the current “letter” web page (like https://kids-in-mind.com/a.htm) with the relative link to the movie page (like /a/abandon.htm).

An example result is https://kids-in-mind.com/a/abandon.htm. I used list comprehension to put them all in a list, links:

           # href links to each movie page are stored in a tags
           a = div[idx].findAll("a")
           links = [urljoin(url, x["href"]) for x in a]

Now I had all of the movie rating information for a given letter page and all the links to the movie pages. The next steps were to:

Parse each string in movies for each rating and other pieces of information
Follow each link in links and parse the sexual content

To make it easier to loop through both lists at once, I used the zip() function:

           # zip these up to make iteration easier in the for loop
           movies_and_links = list(zip(movies, links))

Next, I looped through each movie and each link. First, I parsed the string in movie for the year, MPAA rating, Kids In Mind ratings, and the movie title using a function that I defined called parse_movie():

           for movie, link in movies_and_links:
               # get the information available in the list on each letter page
               year, mpaa, ratings, title = parse_movie(movie)
               print(f"Title is {title}")

This function took a bit of trial and error to write.

At first, I thought all of the strings were formatted like, "Abandon [2002] [PG-13] – 4.4.4".

However, after running the code once, I saw that some of the strings were formatted like this, "Abandon [Foreign Name] [2002] [PG-13] – 4.4.4", with an additional set of brackets containing the film’s name in a different language.

I had to add the code block at the very beginning of the function to skip over this set of brackets.

You can see that the two main functions I used were the string methods find() (to find the brackets) and split() (to isolate the Kids In Mind ratings).

The last tricky bit that gave me trouble was that sometimes the Kids In Mind ratings were separated by an en dash and other times were separated by an em dash:

def parse_movie(movie):
   # some entries had a foreign name in brackets
   if movie.count("]") > 2:
       start_idx = movie.find("]") + 1
   else:
       start_idx = 0

   # year is usually in the first set of brackets
   year_idx1 = movie.find("[", start_idx)
   year_idx2 = movie.find("]", start_idx)

   # mpaa rating was next
   mpaa_idx1 = movie.find("[", year_idx1 + 1)
   mpaa_idx2 = movie.find("]", year_idx2 + 1)

   year = int(movie[year_idx1 + 1:year_idx2].strip())
   mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]

   # the ratings came after a dash and were formatted like #.#.#
   ratings_split = movie.split("–")
   # sometimes they used a dash, sometimes an en dash
   if len(ratings_split) == 1:
       ratings_split = movie.split("-")

   ratings = [int(x) for x in ratings_split[-1].split(".")]

   title = movie[0:year_idx1]

   return year, mpaa, ratings, title

Scrape sexual content description

The additional import statements needed for the code in this section are:

import bs4.element
import time
import random

After parsing movie, it was time to follow the link to the movie’s page and pull a more detailed description of sexual content using the function scrape_kim_sexcontent().

Since this was going to require making many “get” requests to the Kids In Mind website, I also added a variable time delay in between each request using the time.sleep() function. I did this for two reasons:

It’s good practice to add some sort of delay between requests so that you do not overload the website’s server.
Adding a bit of random variation to the time delays can trick the web server into thinking your web scraping script is a human, making it less likely to reject subsequent requests.

Code:

# follow each movie link to get the sex content description
               start = time.time()
               sex_content = scrape_kim_sexcontent(link)
               delay = time.time() - start

               wait_time = random.uniform(.5, 2) * delay
               print(f'Just finished {title}')
               print(f'wait time is {wait_time}')
               time.sleep(wait_time)

Scraping the detailed descriptions proved a bit trickier than getting the Kids In Mind ratings. As I mentioned above, I planned to use the bs4 object method findAll() to get all of the <p> tags and find the one that contained sexual content.

Below is the first iteration of my scrape_kim_sexcontent() function:

def scrape_kim_sexcontent(url):
   # Request html from page and find all p tags
   res = requests.get(url)
   soup = BeautifulSoup(res.text, 'html.parser')
   res.close()
   p_set = soup.findAll("p")


   for entry in p_set:
       if 'SEX/NUDITY' in entry.text:
           sex_content = entry.text
           break

return sex_content

However, I quickly realized that some of the movie pages were organized differently. The screenshot below shows a resulting CSV file. You can see that the script pulled a paragraph from the right side of the web page instead of the sexual content paragraph.

It turns out that some of the movie pages, like the one for Abominable, had the title and text “SEX/NUDITY” in an <h2> tag preceding the <p> tag that contained the detailed description.

To handle this variation, I added some code. The final version of scrape_kim_sexcontent() is below. First, I looked for all of the <h2> tags. Then I looped through them until I found one with an id attribute equal to “sex”. I used the bs4.element.tag attribute, attrs, to access each tag’s attributes as a dictionary.

If you take another look at the Abominable page HTML, you can see that the <p> tag containing the sexual content details is at the same level as the preceding <h2> tag rather than being nested within it.

This means that the <p> tag is a sibling of the <h2> tag, not its child. Thus, I was able to access it using the bs4.element.tag attribute next_siblings which returns a list of the siblings that follow the <h2> tag.

Finally, I used the bs4.element.tag attribute text to get the paragraph I wanted:

def scrape_kim_sexcontent(url):
   # Request html from page and find all h2 tags
   res = requests.get(url)
   soup = BeautifulSoup(res.text, 'html.parser')
   res.close()
   h2_set = soup.findAll("h2")

   # Initialize
   sex_content = ""

   # Check the <h2> tags (headers). If you find id="sex", grab the next paragraph (p tag)
   sibling_iter = []
   for entry in h2_set:
       if "id" in entry.attrs:
           if entry["id"] == "sex":
               sibling_iter = entry.next_siblings

               # Grab the next paragraph
               for sibling in sibling_iter:
                   if type(sibling) == bs4.element.Tag:
                       sex_content = sibling.text

   # Sometimes header <h2> tags aren't used to make the paragraph headers
   # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
   if sex_content == "":
       p_set = soup.findAll("p")

       for entry in p_set:
           if 'SEX/NUDITY' in entry.text:
               sex_content = entry.text
               break

   return sex_content

Organize Data and Save it to a File

The additional import statements needed for the code in this section are:

import pandas as pd
import string

Finally, it was time to organize the scraped data and save it to a CSV file.

I decided to use the pandas library since its to_csv data frame method makes it super easy to save data to a CSV file.

First, after parsing the information for each movie, I saved each piece of data in a dictionary. After each “letter” page was completed, I converted the growing dictionary to a pandas data frame using the pd.DataFrame() method and then saved the resulting data frame to a CSV file.

I decided to write to the CSV file after each “letter” page was completed to make sure that I would have data saved if the web scraping script was interrupted for some reason:

               # Build dictionary for conversion to data frame
               movie_dict["title"].append(title)
               movie_dict["year"].append(year)
               movie_dict["mpaa"].append(mpaa)
               movie_dict["KIM sex"].append(ratings[0])
               movie_dict["KIM violence"].append(ratings[1])
               movie_dict["KIM language"].append(ratings[2])
               movie_dict["KIM sex content"].append(sex_content)

           res.close()

           # Write to the CSV after every letter
           print("\n")
           print("Writing to Movies.csv")
           df_movies = pd.DataFrame(movie_dict)
           df_movies.to_csv("Movies.csv")

           print(f"Done with {letter}. Waiting {wait_time} seconds")
           time.sleep(wait_time)

       else:
           print(f"Error: {res}")

   return df_movies

Lastly, I called the main function scrape_kim_ratings() and provided a list of all the lowercase ASCII letters:

df_movies = scrape_kim_ratings(string.ascii_lowercase)

Conclusion

So, there you have it! Here is a link to the GitHub page with the full script https://github.com/finxter/WebScrapeKidsMovies. I’ll also attach it at the end of this article.

In the future, I think I will add functions to the script that will pull information from other websites and add them to the current database. I think I will also add a function that checks the websites for any new movies/ratings and adds them to the current database.

I hope this will inspire you to write your own web scraping script!

The Script

import pandas as pd
import requests
from bs4 import BeautifulSoup
import bs4.element
import string
import time
from urllib.parse import urljoin
import random


def scrape_kim_sexcontent(url):
    # Request html from page and find all h2 tags
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    res.close()
    h2_set = soup.findAll("h2")

    # Initialize
    sex_content = ""

    # Check the <h2> tags (headers). If you find id="sex", grab the next paragraph (p tag)
    sibling_iter = []
    for entry in h2_set:
        if "id" in entry.attrs:
            if entry["id"] == "sex":
                sibling_iter = entry.next_siblings

                # Grab the next paragraph
                for sibling in sibling_iter:
                    if type(sibling) == bs4.element.Tag:
                        sex_content = sibling.text

    # Sometimes header <h2> tags aren't used to make the paragraph headers
    # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
    if sex_content == "":
        p_set = soup.findAll("p")

        for entry in p_set:
            if 'SEX/NUDITY' in entry.text:
                sex_content = entry.text
                break

    return sex_content


def parse_movie(movie):
    # some entries had a foreign name in brackets
    if movie.count("]") > 2:
        start_idx = movie.find("]") + 1
    else:
        start_idx = 0

    # year is usually in the first set of brackets
    year_idx1 = movie.find("[", start_idx)
    year_idx2 = movie.find("]", start_idx)

    # mpaa rating was next
    mpaa_idx1 = movie.find("[", year_idx1 + 1)
    mpaa_idx2 = movie.find("]", year_idx2 + 1)

    year = int(movie[year_idx1 + 1:year_idx2].strip())
    mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]

    # the ratings came after a dash and were formatted like #.#.#
    ratings_split = movie.split("–")
    # sometimes they used a dash, sometimes an en dash
    if len(ratings_split) == 1:
        ratings_split = movie.split("-")

    ratings = [int(x) for x in ratings_split[-1].split(".")]

    title = movie[0:year_idx1]

    return year, mpaa, ratings, title


def scrape_kim_ratings(letters):
    movie_dict = {"title": [],
                  "year": [],
                  "mpaa": [],
                  "KIM sex": [],
                  "KIM violence": [],
                  "KIM language": [],
                  "KIM sex content": []}

    for letter in letters:
        # Get a response from each letter page
        url = f"https://kids-in-mind.com/{letter}.htm"
        res = requests.get(url)

        if res:
            # Get the HTML from that page
            soup = BeautifulSoup(res.text, "html.parser")
            # The list of movies is in a div tag with class = et_pb_text_inner
            div = soup.findAll("div", class_="et_pb_text_inner")

            # Find the list of movies. It comes after "Movie Reviews by Title"
            idx = 0
            for entry in div:
                text = entry.getText()
                if "Movie Reviews by Title" in text:
                    idx += 1
                    break
                idx += 1

            # All movies on the page, separated by \n (movie names with ratings are stored as text of the div tag)
            movies = div[idx].getText().split("\n")

            # href links to each movie page are stored in a tags
            a = div[idx].findAll("a")
            links = [urljoin(url, x["href"]) for x in a]

            # zip these up to make iteration easier in the for loop
            movies_and_links = list(zip(movies, links))

            for movie, link in movies_and_links:
                # get the information available in the list on each letter page
                year, mpaa, ratings, title = parse_movie(movie)
                print(f"Title is {title}")

                # follow each movie link to get the sex content description
                start = time.time()
                sex_content = scrape_kim_sexcontent(link)
                delay = time.time() - start

                wait_time = random.uniform(.5, 2) * delay
                print(f'Just finished {title}')
                print(f'wait time is {wait_time}')
                time.sleep(wait_time)

                # Build dictionary for conversion to data frame
                movie_dict["title"].append(title)
                movie_dict["year"].append(year)
                movie_dict["mpaa"].append(mpaa)
                movie_dict["KIM sex"].append(ratings[0])
                movie_dict["KIM violence"].append(ratings[1])
                movie_dict["KIM language"].append(ratings[2])
                movie_dict["KIM sex content"].append(sex_content)

            res.close()

            # Write to the CSV after every letter
            print("\n")
            print("Writing to Movies.csv")
            df_movies = pd.DataFrame(movie_dict)
            df_movies.to_csv("Movies.csv")

            print(f"Done with {letter}. Waiting {wait_time} seconds")
            time.sleep(wait_time)

        else:
            print(f"Error: {res}")

    return df_movies


df_movies = scrape_kim_ratings(string.ascii_lowercase)