16 Strange Datasets for Machine Learning

Article by Meiryum Ali | June 07, 2019

Data scientists need AI training data to build effective machine learning models. The goal of training is to build a model that understands the patterns and relationships behind the data. We at Lionbridge AI have already compiled great industry standard datasets out there, like our 50 best machine learning datasets and our datasets for natural language processing.

But what about the weird, obscure datasets that didn’t make the cut? We’ve now put together a list of all the amusing, different, and just straight up bizarre datasets.

 

Strange Datasets for Machine Learning

Length of Chopsticks: Researchers set out to determine the optimal length for chopsticks.

Stacking Cups: This data is available from the World Sport Stacking Association, which allows you to search through different divisions, age groups, competitors, and even state and country records.

Price of Weed: A repository of historical marijuana prices, which shows significant differentiation at the state level in prices.

War History: Nearly 200 years of international threats, conflicts for modeling or prediction. Includes action taken, level of hostility, fatalities and outcomes.

UFO reports: 80,000 historic datasets of UFO sightings, collected over almost a century.

Wine Quality: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal.

Mushrooms: Mushrooms described in physical characteristics classification poisonous or edible.

Million Songs: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Dog Names in Zurich: Check out zany pet names with a dataset containing the names of all registered dogs in Zurich.

Cats in Movies: A dataset that tracks all cats featured in movies. You can search the movies by director, producer, and release date.

Comic book walls in Brussels: Lists the location of all the painted comic-book walls in Brussels, as well as their characters and the cartoonists who created them.

 

Shopping trolleys in Rivers: Annual survey of number of abandoned supermarket trolleys in Bristol rivers from 2005.

Movie Body Counts: This dataset tallies the number of on-screen kills, deaths, and dead bodies in action, sci-fi, and war movies.

100 burritos: A 10-dimensional system for rating the burritos in San Diego. This dataset rates burritos for their volume, tortilla quality, salsa quality and variety, wrap integrity, and more.

 

Indian Movie Theaters: This dataset contains screen sizes, theater capacities, average ticket prices, and location coordinates for each movie theater.

Rick and Morty: A comprehensive Rick and Morty API that includes the characters, locations, and episodes.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.