A Simple Recommendation System Using Pandas corrwith() Method

Rate this post

What is a Recommendation System?

If you use Netflix or Amazon you have already seen the results of recommendation systems – movie or item recommendations that fit your taste or needs. So, at its core a recommendation system is a statistical algorithm that computes similarities based on previous choices or features and recommends users which movie to watch or what else they might need to buy.

How Does a Recommendation System Work?

Assume that persons A and B like a movie M1 and person A also likes movie M2. Now, we can conclude that person B will also like movie M2 with a high probability. Well, that’s very little data and probably a rather imprecise prediction. Yet, it illustrates how collaborative filtering works. In a real world application we would need much more data to make good recommendations. The recommendation algorithms based this concept are called collaborative filtering.

Another popular way to recommend items is so called content-based filtering. Content-based filtering computes recommendations based on similarities of items or movies. In the case of movies we could look at different features like: genre, actors, … to compute similarity.
If a user liked a given movie, the probability is high that the user will also like similar movies. Thus, it makes sense to recommend movies with a high similarity to those the user liked.

Implementing a Recommendation System

If you want to understand the code below better, make sure to sign up for our free email course “Introduction to Pandas and Data Science” on our Email Academy. Throughout the course, we develop a recommendation system for movies. At its core, there is the method corrwith() from the Pandas library.

This is the final implementation of our recommendation system:

Download the MovieLens data set from: https://grouplens.org/datasets/movielens/latest/

How to Use Pandas corrwith() Method?

The Pandas object DataFrame offers the method corrwith() which computes pairwise correlations between DataFrames or a DataFrame and a Series. With the parameter axis, you can either compute correlations along the rows or columns. Here is the complete signature, blue parameters are optional and have default values.

The arguments in detail:
1.) other: A Series or DataFrame with which to compute the correlation.
2.) axis: Pass 0 or ‘index’ to compute correlations column-wise, 1 or ‘columns’ for row-wise.
3.) drop: Drop missing indices from result.
4.) method: The algorithm used to compute the correlation. You can either choose from: ‘pearson’, ‘kendall’ or ‘spearman’ or implement your own algorithm. So, either you pass one of the three strings or a callable.

Here is a practical example:

import pandas as pd

ratings = {
'Spider Man':[3.5, 1.0, 4.5, 5.0],
'James Bond':[1.0, 2.5, 5.0, 4.0],
'Titanic':[5.0, 4.5, 1.0, 2.0]

new_movie_ratings = pd.Series([2.0, 2.5, 5.0, 3.5])
all_ratings = pd.DataFrame(ratings)


From a given dictionary of lists (ratings) we create a DataFrame. This DataFrame has three columns and four rows. Each column contains the movie ratings of all four users.
The Series new_movie_ratings contains the ratings for a new movie of all four users.
Using the method corrwith() on the DataFrame we get the correlation between the new ratings and the old ones.
The output of the snippet above is:

Spider Man    0.566394
James Bond    0.953910
Titanic      -0.962312

As you can see, the new movie has the highest correlation with the James Bond movie. This means, a recommendation system which works purely based on ratings, should recommend the James Bond movie to users that liked the new movie.
Yet, what exactly is correlation?

What is Correlation?

Correlation describes the statistical relationship between two entities. This is to say, it’s how two variables move in relation to one another. Correlation is given as a value between -1 and +1. However, correlation is not causation!

There are three types of correlation:

  • Positive correlation:
    A positive correlation is a value in the range 0.0 < c <= 1.0. A correlation of 1.0 means that if the first variable moves up, the second one will also move up. This relationship is weaker if the correlation is lower than 1.0.
  • Negative correlation:
    A negative correlation is a value in the range 0.0 > c >= -1.0. Negative correlation means that two variable have the opposite behaviour. So, if the first one moves up the second one moves down.
  • Zero or no correlation:
    A correlation of zero means there is no relationship between the two variables. If the first variable moves up, the second one may do anything else.

More Pandas DataFrame Methods

Feel free to learn more about the previous and next pandas DataFrame methods (alphabetically) here:

Also, check out the full cheat sheet overview of all Pandas DataFrame methods.