Top 10 Reddit Datasets for Machine Learning

Article by Limarc Ambalina | October 09, 2019

Previously, we’ve posted other social media data compilations. Today, we will focus on the world’s most popular forum site, Reddit. This guide will introduce the top 10 Reddit datasets for machine learning. 

Known as “the front page of the internet,” Reddit is a forum/social media site where users can post virtually anything and everything. Unlike Facebook, Twitter, or Instagram, the majority of Reddit users remain anonymous. Reddit moderators strictly censor and curate the subforums, known as subreddits. However, anonymity allows people to say what they want in whatever manner they wish. Therefore, Reddit comments and posts are perfect for testing and training numerous natural language processing (NLP) models. Some of these models include content moderation models and sentiment classifiers. 


Best Reddit Datasets for Machine Learning 

Warning: Some of the datasets below were compiled specifically for the training of content moderation models. Therefore, the data may include explicit content. 

 

Reddit Comments Datasets

Reddit Comments Content Moderation

1. Cryptocurrency Reddit Comments Dataset – This dataset contains comments from the subreddit r/cryptocurrency. The data consists of comments posted over five months from November 2017 to March 2018. 

2. Donald Trump Comments on Reddit – A simple dataset containing thousands of comments crawled from Reddit that mention Donald Trump.

3. Reddit Comment Score Prediction – This dataset was built to help create a model that can predict whether or not a Reddit comment will receive upvotes or downvotes. The dataset includes 4 million Reddit comments: 2 million poor-performing (downvoted) and 2 million high-performing (upvoted). 

 

Reddit News Datasets

4. Daily News for Stock Market Prediction – As the title suggests, this dataset was originally made to create models that could predict stock market fluctuations. The data consists of news crawled from r/worldnews from June 2008 to July 2016, as well as Dow Jones Industrial Average stock data. 

5. World News on Reddit – Taken from the r/worldnews subreddit, this dataset contains info about all of the news posted on this subreddit dating back to 2008. The dataset includes the following info: date created, upvotes and downvotes, title, author, and whether or not the news contains mature content.

 

Other Data from Reddit

6. Reddit’s Top 1000 – This dataset contains the top 1,000 posts of all time from 18 subreddits, in terms of upvotes. For each post, the CSV files contain the title of the post and username of the poster. Additionally, the number of upvotes and downvotes, subreddit name, url, and other metadata has been included. 

7. Reddit Usernames – A simple dataset containing a CSV file of 26 million usernames of Reddit users. Furthermore, the dataset includes the total number of comments each user has made.

8. SARC: Self-Annotated Reddit Corpus for Sarcasm – This dataset consists of over 1.3 million sarcastic comments and posts crawled from Reddit. The dataset creator has labeled the sarcasm in each statement. In addition, the username of the poster, topic, and context is also included with each statement. 

9. Science and Tech Acronyms from Reddit – This dataset contains over 140,000 acronyms found on subreddits about science, biology, technology, and futurology. The data is in the form of a CSV file which includes the comment ID, time, username, subreddit name, and the acronym mentioned. 

10. Things on Reddit (products) – This product dataset is a collection of the top 100 Amazon products from every subreddit that has ever posted an Amazon product from 2015 to 2017. Each CSV file in the dataset includes the name of the product, category, and URL to the product. Furthermore, the total mentions on Reddit and total subreddit mentions have been included in the data. 


The datasets above could be used to help train sentiment analysis models, text classifiers, predictive models, and other NLP algorithms. For more social media datasets, please view our related resources below. If you are looking to build custom datasets, get in touch with our sales team to learn how Lionbridge can help improve your AI models. 

Looking for AI training data?
The Author
Limarc Ambalina

Limarc writes content for Lionbridge’s website as part of the marketing team. Born and raised in Canada, Limarc’s love of Japanese pop culture brought him to Japan in 2016 and living in Japan has been his dream come true. Apart from Lionbridge content, you can catch Limarc online writing about anime, video games, and other nerd culture.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.