How to Scrape the Details of 250 Top Rated Movies in Python

This article indicates a way to scrape imdb.com/chart/top/, a website that contains 250 numbers of top-rated Movies.

This article is solely for educational purposes.

👉 Recommended Tutorial: Web Scraping – Is It Legal?

The tool used to extract data from a website is Scrapy, and the software system is UNIX operating system.

Virtual Environment Set Up

It is better to use a virtual environment for setting up the project. There are different methods to establish a virtual environment, and here we use the venv module of python for setting up our project environment.

$ python3 -m venv 'name_of_enviro'

Once created a virtual environment, it has to be used as a project environment, using the subsequent command.

$ source 'name_of_enviro'/bin/activate

After activating the virtual environment, the prompt can show the virtual environment name as follows (I am using `venv` as the name of the virtual environment.

(venv)$

Scrapy Installation

This command will install Scrapy in the virtual environment.

(venv)$ pip install scrapy

Creating a Project in Scrapy

Before starting to extract, we need to set up a new Scrapy Project using a directory name to store all scrapy codes and run.

(venv)$ scrapy startproject top250Movies

The above command creates a `top250Movies` directory with the following files and directories.

top250Movies
├── scrapy.cfg  # deploy configuration file
└── top250Movies  # project's Python module
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders   # a directory for running the codes
        └── __init__.py

After starting a new project, always move to the project directory.

Our project directory is named `top250Movies`, so we move into that directory and start writing our codes by creating a python file inside the `spiders` directory.

The scrapy module can only run the project from within the project directory. Otherwise, it will generate an error.

(venv)$ cd top250Movies
(venv)$ top250Movies>

Start the coding

Let’s create a python file inside the directory named spiders.

(venv)$ top250Movies>touch top250Movies/spiders/firstSpider.py

So we created our project file, and now we need to import the library and build a spider. Spiders are the place where we define the custom behavior for crawling and parsing pages for a particular site or a group of sites).

import scrapy


class FirstSpider(scrapy.Spider):
    name = 'movies'
    start_urls = ['https://www.imdb.com/chart/top/']
    def parse(self, response):
        pass

The above codes will explain the structure of a spider.

The scraping goes through the following cycles:

Start by generating the initial Requests to crawl the first URL and specify a callback function to be called with the response downloaded from those requests.
In the callback function, we parse the response/URL and return item objects.
In callback functions, we parse the page contents, typically using Selectors (CSS Selector / Xpath Selector).
Finally, the items returned from the spider will persist in a database or be written to a file using Feed exports (JSON, CSV, etc.)

Here we need to follow the naming of variables and functions.

`movies` is a string that defines the name of this spider. The Scrapy is to use the spider name to locate the specified spider, so it must be unique.
`start_urls` contains a list of URLs on which the spider will begin to crawl.
`parse(response)` is the default callback used by Scrapy to process the response received when their requests are not specifying a callback. Then, the `parse` method will process the response and return the extracted data.

Before proceeding, we can look into the scrapy shell, which is very useful to identify the items we need to scrape and execute the same to ensure that we are getting the exact result as expected.

For starting the scrapy shell, the following command will be used:

(venv)[top250Movies]$ scrapy shell
>>> 
>>>fetch('https://www.imdb.com/chart/top/')

The fetch command followed with URL will download the given URL using the Scrapy downloader and writes the contents to the response object.

The response is an object that represents an HTTP response, which is usually downloaded and fed to the Spiders for processing.

Response Parameters:

url (string) – it will return the URL of the response.
status (integer) – it will return the HTTP status. If the output is 200, then it is good to go.
headers (dictionary) – it will return the headers. The dictionary values can be strings for single-valued headers and lists for multi-valued headers.
body (bytes) – it will return the response body. To access the decoded text as a string, we use response.text.
view (response) — it will open the response URL in a browser. Sometimes spiders see the URL pages differently from regular users. Hence this can be used to confirm what the spider sees and what we expect.

Selecting Element Attributes

There are different ways to get a value of an attribute. Here we use simple CSS syntax:

>>> response.css("td.titleColumn a::text").get()
'The Shawshank Redemption'

While inspecting the IMDb site for getting the Movie name, the particular CSS selector will be:

<td class="titleColumn">
<a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman" >The Shawshank Redemption</a>
 	<span class="secondaryInfo">(1994)</span>
</td>

For getting all movie lists, instead of the get() method, we will use the getall() method, which will return all the Movie names as a list.

Similarly, we can use the following CSS selector for getting the movie release years which is inside the <span>

>>> response.css("td.titleColumn span::text").getall()

In each code, we used `::text` after the <a> and <span> tags which will extract the text content under the each tags.

Now we are good at writing code in our spider for crawling the website.

import scrapy
class FirstSpider(scrapy.Spider):
    name = 'movies'
    start_urls = ['https://www.imdb.com/chart/top/']
  	
def parse(self, response):
    movieContent = response.css("td.titleColumn")
    for item in movieContent:
        movieName = item.css("a::text").get()
        movieYear = item.css("span::text").get()
        movieDict = {'Movie Name': movieName,
                     'Release Year': movieYear}
        yield movieDict
    pass

Here we used an additional variable `movieContent`, which stores the details of both <a> and <span> tags as list of selectors.

We can now iterate over all the titleColumn elements and put them together into a Python dictionary. Scrapy spider generates dictionaries of data extracted from the page. Hence we use the yield keyword of Python in the callback as shown in the above code.

To execute the spider without any error, we have to reach the project’s top-level directory and run:

(venv)[~/top250Movies] $ scrapy crawl movies

This command will execute the spider with ‘movies’ as names, which we have added to the spider, which will send some requests for the URL domain for getting an output similar to this:

We have extracted 250 top movie names with the released year and logged the same into our screen.

Now we will see how to store this output into a file.

To store the output data into a file, we will use the parameter -o along with the filename (JSON/CSV).

(venv)[~/top250Movies] $ scrapy crawl movies -o movieList.csv

It will create a CSV file with the name `movieList` in our project directory.

Summary

This article taught us how to install scrapy into a virtual environment.

We learned how to start a project in scrapy and the basic structure of a scrapy project folder.

We learned about the scrapy shell and the commands for getting the details from a URL

Afterward, we learned how to write the spider for scraping the IMDb website using CSS selectors.

Last but not least, we learned how to generate the output and how we can store the data into a JSON/CSV file.

👉 Recommended Tutorial: Python Developer — Income and Opportunity