12 Best Social Media Datasets for Machine Learning

Article by Rei Morikawa | June 03, 2019

Social media and social networking sites are online platforms where people can connect with their real-life family, friends, and colleagues, and build new relations with others who share similar interests. The most popular English social media sites in 2019 are Twitter, Facebook, and Reddit.

 

How is social media used in machine learning?

Social media data is the largest, most dynamic dataset about human behavior. It gives social scientists and business experts a world of new opportunities to understand people, groups, and society. Sentiment analysis is the common way that machine learning is applied in social media. For example, when a new product is released, your customers might tweet about it or leave their review on Amazon. Businesses can use machine learning to understand the general public’s reaction to their own or a competitor’s new product or design.

 

Search and Download Social Media Datasets 

One way to gather social media data is to use a web scraping tool that extracts data from social media channels, such as Facebook, Twitter, LinkedIn, and Instagram. Please note for some social networking sites, using the data from their platform is a terms violation. You should read the terms of service carefully to avoid legal issues.

Another good place to start is the official API documentation for social media sites like Facebook and Twitter. This will tell you how to build a query and how to search for posts with exact words. You can program it with your preferred language (JS, PHP, Perl, Python, etc.) and still take advantage of OSS.

 

For this blog post, we’ve combed the web and put together the ultimate cheat sheet for social media datasets for machine learning.

 

Social Media Dataset Finders

  • Social Computing Data Repository: This collection provides a large variety of datasets from multiple sources such as Twitter and YouTube, in varying sizes.
  • Stanford Large Network Dataset Collection (SNAP): Similar to the Social Computing Data Repository, SNAP also has a wide range of datasets of varying size, from different sources such as Twitter and Reddit, so you can find the one that best fits your project needs. In addition, SNAP is a library that allows for easy integration and analysis of large networks in general, including the SNAP datasets.
  • Network Repository: This collection has many social networks, web graphs, bio and brain networks, etc. They also have interactive visual analytic tools to compare and explore the various social networks.

 

Twitter Datasets

  • 476 Million Twitter Tweets: This dataset is estimated to comprise about 20-30% of all public tweets posted over the 7-month period between June 1 and December 31, 2009.
  • Sentiment140: With emoticons removed and six formatting categories, this collection of 160,000 tweets is particularly useful for brand management and polling purposes.
  • Customer Support on Twitter: This dataset on Kaggle includes over 3 million tweets and replies from the biggest brands on Twitter.
  • Cheng-Caverlee-Lee September 2009~January 2010 Twitter Scrape: This dataset is a collection of scraped public twitter updates used in coordination with an academic project to study the geolocation data related to twittering.
  • Followthehashtag: Twitter analysis tool that includes a section where different, complete large datasets are regularly uploaded in a ready-to-use format.

 

Reddit Datasets

  • 1.7 Billion Reddit Comments: 1.7 Billion JSON objects complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit’s API.
  • May 2015 Reddit Comments: This dataset is a small portion of the enormous 1.7 billion Reddit comments dataset. You can find all the comments from May 2015 on scripts for natural language processing (NLP).

 

Other Social Media Datasets

  • YouTube-8M Dataset: This is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities.
  • One Hundred Million Creative Commons Flickr Images for Research: One of the largest public multimedia datasets ever released, this dataset includes 99.3 million images and 0.7 million videos, all from Flickr and all under Creative Commons licensing. This dataset would be useful for computer vision projects, with the added benefit that many of the images are geotagged, enabling some interesting explorations of the intersection of geographical and image features. Request the dataset.

 

Still can’t find the data you need? Lionbridge AI provides custom social media datasets in 300 languages for your specific machine learning project needs. 

Interested? Get high-quality social media data now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.