Sentiment analysis models require large, specialized datasets to learn effectively. Since only specific kinds of data will do, one of the most difficult parts of the training process can be finding enough relevant data.
To try to combat this, we’ve compiled a list of datasets that covers a wide spectrum of sentiment analysis use cases. From sets of movie reviews to multilingual sentiment lexicons, the following list showcases the diversity present in these datasets and hints at some of the endless ways that you can improve your algorithm.
Despite this, don’t forget that the datasets below are built for a variety of specific algorithms. While we’ve tried to select datasets with a broad scope, they’ve still been assembled to support research that could differ significantly from your project. As such, you should evaluate whether the data needs new labels or an extra round of cleaning to fit with your particular training goals.
Sentiment Lexicons for 81 Languages: From Afrikaans to Yiddish, this dataset groups words from 81 different languages into positive and negative sentiment categories.
Lexicoder Sentiment Dictionary: This dataset contains words in four different positive and negative sentiment groups, with between 1,500 and 3,000 entries in each subset.
Dictionaries for movies and finance: This is a library of domain-specific dictionaries which shows the polarised sentimental use of words in either movie reviews or financial documents.
Multi-Domain Sentiment Dataset: Containing product reviews numbering in the hundreds of thousands, this dataset has positive and negative files for a range of different Amazon product types.
Amazon product data: Stanford professor Julian McAuley has made ‘small’ subsets of a 142.8 million Amazon review dataset available to download here.
Cornell movie review data: This page provides links to a variety of Cornell’s movie review data for use in sentiment analysis, organised into sentiment polarity, sentiment scale and subjectivity sections.
Stanford Sentiment Treebank: Stanford’s dataset contains just over 10,000 pieces of data from HTML files of Rotten Tomatoes reviews.
Bag of Words Meets Bags of Popcorn: With 50,000 labeled IMDB movie reviews, this dataset would be useful for sentiment analysis use cases involving binary classification.
IMDB Movie Reviews Dataset: Also containing 50,000 reviews, this dataset is split equally into 25,000 training and 25,000 test sets. It also provides unannotated data as well.
OpinRank Dataset: This dataset contains a combined 300,000 full reviews of cars and hotels from the TripAdvisor and Edmunds websites.
Restaurant Reviews Dataset: A collection of 52,000 reviews of restaurants in the New York area, complete with ratings, is available here.
Sentiment140: With emoticons removed and six formatting categories, this collection of 160,000 tweets is particularly useful for brand management and polling purposes.
Paper Reviews Data Set: Created to predict the opinion of academic paper reviews, this dataset is a collection of Spanish and English reviews from a conference on computing.
Twitter Airline Sentiment: This dataset contains tweets about various airlines that were classified as positive, negative, or neutral.
Finally, just for fun:
Panic! at the Dataset: This dataset is entirely comprised of songs by Panic! at the Disco labelled for sentiment analysis.
Still can’t find what you need? Lionbridge provides custom datasets for sentiment analysis in over 300 languages. Whether you need hundreds or millions of data points, our 500,000+ certified language specialists can ensure that your algorithm has a solid ground truth. Contact us now to see how we can make your model great.