How to Build a Movie Recommendation System

Article by Ramya Vidiyala | September 17, 2020

Have you ever wondered how YouTube recommends content, or how Facebook recommends you new friends? Perhaps you’ve noticed similar recommendations with LinkedIn connections, or how Amazon will recommend you similar products while you’re browsing. All of these recommendations are made possible by the implementation of recommender systems.

Recommender systems encompass a class of techniques and algorithms that can suggest “relevant” items to users. They predict future behavior based on past data through a multitude of techniques including matrix factorization.

In this article, I’ll look at why we need recommender systems and the different types in use online. Then, I’ll show you how to build your own movie recommendation system using an open-source dataset.


  • Why Do We Need Recommender Systems?
  • Types of Recommender Systems
    • Content-Based Movie Recommendation Systems
    • Collaborative Filtering Movie Recommendation Systems
  • The Dataset
  • Designing a Movie Recommendation System
  •  Implementation
    • Step 1: Matrix Factorization-based Algorithm
    • Step 2: Creating Handcrafted Features
    • Step 3: Creating a Final Model for our Movie Recommendation System
  • Performance Metrics
  • Summary


Why Do We Need Recommender Systems?

We now live in what some call the “era of abundance”. For any given product, there are sometimes thousands of options to choose from. Think of the examples above: streaming videos, social networking, online shopping; the list goes on. Recommender systems help to personalize a platform and help the user find something they like.

The easiest and simplest way to do this is to recommend the most popular items. However, to really enhance the user experience through personalized recommendations, we need dedicated recommender systems.

From a business standpoint, the more relevant products a user finds on the platform, the higher their engagement. This often results in increased revenue for the platform itself. Various sources say that as much as 35-40% of tech giants’ revenue comes from recommendations alone.

Now that we understand the importance of recommender systems, let’s have a look at types of recommendation systems, then build our own with open-sourced data!


Types of Recommender Systems

Machine learning algorithms in recommender systems typically fit into two categories: content-based systems and collaborative filtering systems. Modern recommender systems combine both approaches.

Let’s have a look at how they work using movie recommendation systems as a base.


Content-Based Movie Recommendation Systems

Content-based methods are based on the similarity of movie attributes. Using this type of recommender system, if a user watches one movie, similar movies are recommended. For example, if a user watches a comedy movie starring Adam Sandler, the system will recommend them movies in the same genre, or starring the same actor, or both. With this in mind, the input for building a content-based recommender system is movie attributes.

An overview of content based movie recommendation systems
Figure 1: Overview of Content-Based System


Collaborative Filtering Movie Recommendation Systems

With collaborative filtering, the system is based on past interactions between users and movies. With this in mind, the input for a collaborative filtering system is made up of past data of user interactions with the movies they watch.

For example, if user A watches M1, M2, and M3, and user B watches M1, M3, M4, we recommend M1 and M3 to a similar user C. You can see how this looks in the figure below for clearer reference. 

An example of the collaborative filtering movie recommendation system
An example of the collaborative filtering movie recommendation system


This data is stored in a matrix called the user-movie interactions matrix, where the rows are the users and the columns are the movies.

Now, let’s implement our own movie recommendation system using the concepts discussed above.


The Dataset

For our own system, we’ll use the open-source MovieLens dataset from GroupLens. This dataset contains 100K data points of various movies and users.

We will use three columns from the data:

  • userId
  • movieId
  • rating

You can see a snapshot of the data in figure 3, below:

Figure 3: Snapshot of the data.


Designing our Movie Recommendation System

To obtain recommendations for our users, we will predict their ratings for movies they haven’t watched yet. Movies are then indexed and suggested to users based on these predicted ratings.

To do this, we will use past records of movies and user ratings to predict their future ratings. At this point, it’s worth mentioning that in the real world, we will likely encounter new users or movies without a history. Such situations are called cold start problems.

Let’s take a brief look at how cold start problems can be addressed.


Cold Start Problems

Cold start problems can be handled by recommendations based on meta-information, such as:

  • For new users, we can use their location, age, gender, browser, and user device to predict recommendations.
  • For new movies, we can use genre, cast, and crew to recommend it to target users.



For our recommender system, we’ll use both of the techniques mentioned above: content-based and collaborative filtering. To find the similarity between movies for our content based method, we’ll use a cosine similarity function. For our collaborative filtering method, we’ll use a matrix factorization technique.

The first step towards this is creating a matrix factorization-based model. We’ll use the output of this model and a few handcrafted features to provide inputs to the final model. The basic process will look like this:

  • Step 1: Build a matrix factorization-based model
  • Step 2: Create handcrafted features
  • Step 3: Implement the final model

We’ll look at these steps in greater detail below.


Step 1: Matrix Factorization-based Algorithm

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. This family of methods became widely known during the Netflix prize challenge due to how effective it was.

Matrix factorization algorithms work by decomposing the user-movie interaction matrix into the product of two lower dimensionality rectangular matrices, say U and M. The decomposition is done in such a way that the product results in almost similar values to the user-movie interaction matrix. Here, U represents the user matrix, M represents the movie matrix, n is the number of users, and m is the number of movies.

Each row of the user matrix represents a user and each column of movie matrix represents a movie.


Once we obtain the U and M matrices, based on the non-empty cells in the user-movie interaction matrix, we perform the product of U and M, and predict the values of non-empty cells in the user-movie interaction matrix.

To implement matrix factorization, we use a simple Python library named Surprise, which is for building and testing recommender systems. The data frame is converted into a train set, a format of data set to be accepted by the Surprise library.

from surprise import SVD
import numpy as np
import surprise

from surprise import Reader, Dataset
# It is to specify how to read the data frame.
reader = Reader(rating_scale=(1,5))

# create the traindata from the data frame
train_data_mf = Dataset.load_from_df(train_data[['userId', 'movieId', 'rating']], reader)

# build the train set from traindata. It is of dataset format from surprise library
trainset = train_data_mf.build_full_trainset()

svd = SVD(n_factors=100, biased=True, random_state=15, verbose=True)

Now the model is ready. We’ll store these predictions to pass to the final model as an additional feature. This will help us incorporate collaborative filtering into our system.

#getting predictions of train set
train_preds = svd.test(trainset.build_testset())
train_pred_mf = np.array([pred.est for pred in train_preds])

Note that we have to perform the above steps for test data also.


Step 2: Creating Handcrafted Features

Let’s convert the data in the data frame format into a user-movie interaction matrix. Matrices used in this type of problem are generally sparse because there’s a high chance users may only rate a few movies.

The advantages of the sparse matrix format of data, also called CSR format, are as follows:

  • efficient arithmetic operations: CSR + CSR, CSR * CSR, etc.
  • efficient row slicing
  • fast matrix-vector products

scipy.sparse.csr_matrix is a utility function that efficiently converts the data frame into a sparse matrix.

# Creating a sparse matrix
train_sparse_matrix = sparse.csr_matrix((train_data.rating.values, (train_data.userId.values, train_data.movieId.values)))


‘train_sparse_matrix’ is the sparse matrix representation of the train_data data frame.

We’ll create 3 sets of features using this sparse matrix:

  1. Features which represent global averages
  2. Features which represent the top five similar users
  3. Features which represent the top five similar movies

Let’s take a look at how to prepare each in more detail.


1. Features which represent the global averages

The three global averages we’ll employ are:

  1. The average ratings of all movies given by all users
  2. The average ratings of a particular movie given by all users
  3. The average ratings of all movies given by a particular user

train_averages = dict()
# get the global average of ratings in our train set.
train_global_average = train_sparse_matrix.sum()/train_sparse_matrix.count_nonzero()
train_averages['global'] = train_global_average

Output: {‘global’: 3.5199769425298757}


Next, let’s create a function which takes the sparse matrix as input and gives the average ratings of a movie given by all users, and the average rating of all movies given by a single user.

# get the user averages in dictionary (key: user_id/movie_id, value: avg rating)
def get_average_ratings(sparse_matrix, of_users):

# average ratings of user/axes
ax = 1 if of_users else 0 # 1 - User axes,0 - Movie axes

# ".A1" is for converting Column_Matrix to 1-D numpy array
sum_of_ratings = sparse_matrix.sum(axis=ax).A1

# Boolean matrix of ratings ( whether a user rated that movie or not)
is_rated = sparse_matrix!=0

# no of ratings that each user OR movie..
no_of_ratings = is_rated.sum(axis=ax).A1

# max_user and max_movie ids in sparse matrix
u,m = sparse_matrix.shape
# create a dictionary of users and their average ratings..
average_ratings = { i : sum_of_ratings[i]/no_of_ratings[i]

for i in range(u if of_users else m)
if no_of_ratings[i] !=0}

#return that dictionary of average ratings

return average_ratings

Average rating given by a user:
train_averages['user'] = get_average_ratings(train_sparse_matrix, of_users=True)

Average ratings given for a movie:
train_averages['movie'] = get_average_ratings(train_sparse_matrix, of_users=False)


2. Features which represent the top 5 similar users

In this set of features we will create the top 5 similar users who rated a particular movie. The similarity is calculated using cosine similarity between the users.

# compute the similar Users of the "user"

user_sim = cosine_similarity(train_sparse_matrix[user], train_sparse_matrix).ravel()
top_sim_users = user_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.

# get the ratings of most similar users for this movie
top_ratings = train_sparse_matrix[top_sim_users, movie].toarray().ravel()

# we will make it's length "5" by adding movie averages to
top_sim_users_ratings = list(top_ratings[top_ratings != 0][:5])
top_sim_users_ratings.extend([train_averages['movie'][movie]]*(5 -len(top_sim_users_ratings)))


3. Features which represent the top 5 similar movies

In this set of features we obtain the top 5 similar movies rated by a particular user. This similarity is calculated using cosine similarity between the movies.

# compute the similar movies of the "movie"
movie_sim = cosine_similarity(train_sparse_matrix[:,movie].T,
top_sim_movies = movie_sim.argsort()[::-1][1:]
# we are ignoring 'The User' from its similar users.
# get the ratings of most similar movie rated by this user
top_ratings = train_sparse_matrix[user, top_sim_movies].toarray().ravel()
# we will make it's length "5" by adding user averages to
top_sim_movies_ratings = list(top_ratings[top_ratings != 0][:5])


We append all these features for each movie-user pair and create a data frame. Figure 6 is a snapshot of our data frame.

Figure 6


Here’s a more detailed breakdown of its contents:

  • GAvg: Average rating of all the ratings
  • Similar users rating of this movie:
    • sur1, sur2, sur3, sur4, sur5 ( top 5 similar users who rated that movie )
  • Similar movies rated by this user:
    • smr1, smr2, smr3, smr4, smr5 ( top 5 similar movies rated by user)
  • UAvg: User AVerage rating
  • MAvg: Average rating of this movie
  • rating: Rating of this movie by this user.


Once we have these 13 features ready, we’ll add the Matrix Factorization output as the 14th feature. In Figure 7 you can see a snapshot of our data after adding the output from Step 1.

Figure 7


The last column, named, mf_svd, is the additional column that contains the output of the model performed in Step 1.


Step 3: Creating a final model for our movie recommendation system

To create our final model, let’s use XGBoost, an optimized distributed gradient boosting library.

# prepare train data
x_train = final_data.drop(['user', 'movie','rating'], axis=1)
y_train = final_data['rating']
# initialize XGBoost model
xgb_model = xgb.XGBRegressor(silent=False, n_jobs=13,random_state=15,n_estimators=100)
# fit the model, y_train, eval_metric = 'rmse')


Performance Metrics

There are two main ways to evaluate a recommender system’s performance: Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). RMSE measures the squared loss, while MAPE measures the absolute loss. Lower values mean lower error rates, and thus better performance.

Both are good as they allow for easy interpretation. Let’s take a look at what each of them is:


Root Mean Squared Error (RMSE)

RMSE is the square root of the average of squared errors and is given by the below formula.


r is the actual rating,
r^ is the predicted ratings and
N is the total number of predictions


Mean Absolute Percentage Error (MAPE)

MAPE measures the error in percentage terms. It is given by the formula below:


r is the actual rating,
r^ is the predicted ratings and
N is the total number of predictions


#dictionaries for storing train and test results
test_results = dict()
# from the trained model, get the predictions
y_est_pred = xgb_model.predict(x_test)
# get the rmse and mape of train data
rmse = np.sqrt(np.mean([ (y_test.values[i] - y_test_pred[i])**2 for i in
range(len(y_test_pred)) ]))
mape = np.mean(np.abs( (y_test.values- y_test_pred)/y_true.values )) * 100
# store the results in train_results dictionary
test_results = {'rmse': rmse_test, 'mape' : mape_test, 'predictions' : y_test_pred}

Test data results


Our model resulted in 0.67 RMSE, and 19.86 MAPE on the unseen test data, which is a good-to-go model. An RMSE value of less than 2 is considered good, and a MAPE less than 25 is excellent. That said, this model can be further enhanced by adding features that would be recommended based on the top picks dependent on location or genre. We could also test the efficacy of our various models in real-time through A/B testing.



In this article, we learned the importance of recommender systems, the types of recommender systems being implemented, and how to use matrix factorization to enhance a system. We then built a movie recommendation system that considers user-user similarity, movie-movie similarity, global averages, and matrix factorization. These concepts can be applied to any other user-item interactions systems.

Thanks for reading! If you would like to experiment with this custom dataset yourself, you can download the annotated data on GroupLens and see my code at Github.


If you’d like to read more of Ramya’s technical articles, be sure to check out the related resources below. You can also sign up to the Lionbridge AI newsletter for technical articles delivered straight to your inbox.

Subscribe to our newsletter for more technical articles
The Author
Ramya Vidiyala

Ramya is a data nerd and a passionate writer who loves exploring and finding meaningful insights from data. She writes articles on her Medium blog about ML and data science where she shares her experiences to help readers understand concepts and solve problems. Reach out to her on Twitter (@ramya_vidiyala) to start a conversation!


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.