14 Best Movie Datasets for Machine Learning Projects

Article by Rei Morikawa | July 18, 2019

We at Lionbridge have compiled a list of 14 movie datasets. The data on this list can be useful from a statistical learning perspective, because you can use them to master basic machine learning concepts, instead of relying on dry, esoteric datasets.

Many of the datasets on this list contain data points such as the cast and crew members, script, run time, and reviews. You could use these movie datasets for machine learning projects in natural language processing, sentiment analysis, and more.


Movie Datasets for Machine Learning

IMDB Film Reviews Dataset: This dataset contains 50,000 movie reviews, and is already split equally into training and test sets for your machine learning model. It also provides unannotated documents for unsupervised learning algorithms.

IMDB reviews: This is a dataset of 5,000 movie reviews for sentiment analysis tasks in CSV format.

OMDb API: The OMDb API is a web service to obtain movie information. It is a crowdsourced movie database that is kept up-to-date with the most current movies.

MovieLens 20M Dataset: This dataset includes 20 million ratings and 465,000 tag applications, applied to 27,000 movies by 138,000 users.

Cornell Film Review Data: Movie review documents labeled with their overall sentiment polarity (positive or negative) or subjective rating (ex. “two and a half stars”), and sentences labeled with their subjectivity status (subjective or objective) or polarity.

Film Dataset from UCI: This dataset contains a list of over 10,000 movies, including many historical, minor, and cult films, with information on actors, cast, directors, producers, and studios.

Cornell Movie Dialogs Corpus: This corpus contains 220,579 conversational exchanges between 10,292 pairs of movie characters.

Full MovieLens Dataset on Kaggle: Metadata for 45,000 movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

Linguistic Data of 32k Film Subtitles with IMBDb Meta-Data: Meta-data for 32,000+ films. The meta-data are matched to word-count categories from subtitle files.

French National Cinema Center Datasets: Datasets related to French films, including box office data.

Movie Industry: This repository includes 6820 movies (220 movies per year, 1986~2016). Each movie has the following data points: budget, company, country, director, genre, gross revenue, rating, release date, runtime, IMDb user rating, main actor.

Cats in Films: This dataset tracks all cats featured in movies. You can search the movies by director, producer, and release date.

Movie Body Counts: This dataset tallies the number of on-screen kills, deaths, and dead bodies in action, sci-fi and war movies.

Indian Movie Theaters: This dataset contains screen sizes, theater capacities, average ticket prices, and location coordinates for each movie theater.

We hope you found the movie datasets on this list helpful in your project. If you’re still looking for more data, be sure to check out our datasets library.

Get high-quality data for machine learning now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.