Content-Based Movie Recommendation System

5 min readJun 30, 2021

Using Cosine Similarity

What are Recommendation Systems?

A recommender system, or a recommendation system, is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. - Wikipedia. They make use of machine learning algorithms to give relevant product suggestions to users either based on their past activity or based on the activities of users similar to them. Amazon, Netflix and YouTube are a few examples of companies that make use of them.

Project Goal

This project aims to build a movie recommendation system for a streaming service that offers a wide variety of movies on their platform. I’m only covering Content-Based system here.

Scenario

They would like to be able to cater to new users on the platform using information about the movies they’ve enjoyed in the past and recommending movies similar to them.

The Data

The data used here was obtained from the latest movie lens dataset. It contains about 100,836 user ratings, 9,742 movie and 3,683 movie tags. It also contains a movie links file but it won’t be used here.

Data Exploration

Import necessary packages

First, we’ll import Pandas to load our data into Jupyter notebook.

Seaborn and Matplotlib are for data visualization.

The function of CountVectorizer is to tokenize a collection of documents in order to build a vocabulary of known words and use it to encode documents.

Cosine Similarity determines how vectors are related to each other by measuring the cosine angle between two vectors. The value ranges from [-1, 1]. A value of -1 means the vectors are diametrically opposed, 0 means they are perpendicular and 1 means they are the same

Load csv files into Pandas Dataframe.

We’ll check that there are no null values in each of the dataframes and make sure all our variables are in the right format. We’ll also take a look at the first five rows of each of them.

The movies dataframe contains the movie id, title and the genres. It has 9,742 rows and 3 columns and has no null values.

The ratings dataframe has 100,836 rows and 4 columns with no null values. The variables are in the right format. It contains info about the user id, movie id, rating and timestamp.

Tags dataframe has 3,683 rows and 4 columns containing info about the user id, movie id, tag and timestamp.

First we will drop the ‘timestamp’ column from both rating and tags dataframe.

Exploratory Data Analysis

Frequency Distribution of Rating Scores in the ratings dataset

In the plot below, we see that the rating scale ranges from 0.5 to 5.0 with increments of 0.5. The most prevalent ratings given are 3.0, and 4.0 with 5.0 coming in third. We also see that people were less likely to give low ratings as evidenced by the low number of movies rated between 0.5 and 2.5.

To see the average ratings per genre, we are going to merge the movies and ratings datasets.

When we take a look, we see that there are multiple genres for each movie. So in order to use the genres columns, we’ll have to unravel it.

The genres with the highest average ratings are Film-Noir, War and Documentary.

Now we’ll create a word cloud showing the most commonly mentioned genre in the dataset. Drama, Comedy, Thriller, Action and Romance are the top 5 most frequent genres.

Feature Engineering with the genres column and the tags column.

First, we’ll merge the movie_ratings dataframe with the tags dataframe to form mt_ratings dataframe.

Splitting each genre variable in mt_ratings df, removing the ‘|’ symbol and converting them to lowercase

Next, we’ll form a ‘combined_features’ column by creating a bag of words that contain the words from genres and tag for each movie row.

Building the Content-Based Recommendation System.

In this section we are going to create a Class which uses CountVectorizer from sklearn.feature_extraction.text to transform the combined_features column in mt_ratings into a matrix of token counts and cosine_similarity from sklearn.metrics.pairwise to calculate the cosine similarity on the vectorized column.

We’ll save this class as a .py file and import into our notebook as well.

When the class is instantiated, it takes in the name of the movie and the mt_ratings dataframe. It then returns a list of top fifteen movies with the highest cosine similarity to a given movie title and these are recommended to the user.