15 Best Audio and Music Datasets for Machine Learning Projects

Article by Rei Morikawa | July 18, 2019

At Lionbridge, we have deep experience helping the world’s largest companies teach applications to understand audio. From virtual assistants to in-car navigation, all sound-activated machine learning systems rely on large sets of audio data. This time, we at Lionbridge combed the web and compiled this ultimate cheat sheet for public audio and music datasets for machine learning.


Audio Speech Datasets for Machine Learning

AudioSet: AudioSet is an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Common Voice: From Mozilla, Common Voice is an open-source multi-language speech dataset that is partly created by online volunteer contributors. It was built to help train speech-enabled technologies.  

LibriSpeech: LibriSpeech is a carefully segmented and aligned corpus of approximately 1000 hours of 16kHz read English speech, derived from read audiobooks.

Spoken Digit Dataset: This dataset was created to solve the task of identifying spoken digits in audio samples.

Flickr Audio Caption Corpus: This corpus includes 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery.

Spoken Wikipedia Corpora: This is a corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia. Hundreds of hours of aligned audio, and annotations can be mapped back to the original html.

VoxCeleb: VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

VoxForge: This dataset includes open speech data available in 17 languages, including English, Chinese, Russian, and French. 

Freesound: This is a platform for the collaborative creation of audio collections labeled by humans and based on Freesound content.

TED-LIUM: The TED-LIUM corpus consists of approximately 118 hours of speech from various English-language Ted Talks. The audio files also include accompanying transcriptions. 


Acoustic Datasets for Machine Learning

Mivia Audio Events Dataset: This dataset includes 6,000 events of surveillance applications, namely glass breaking, gunshots, and screams. The events are divided into a training set composed of 4,200 events and a test set composed of 1,800 events.

DCASE 2017 Challenge Data: These are open datasets used and collected for the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge.


Music Datasets for Machine Learning

Million Song Dataset: This is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Ballroom: This music dataset includes data on ballroom dancing, such as online lessons. It provides characteristic excerpts and tempi of dance styles in real audio format.

Free Music Archive (FMA): This is a dataset for music analysis that consists of full-length and HQ audio, pre-computed features, and track and user-level metadata.


We hope this list of the best audio and music datasets proves useful to you in your own projects. If you missed our previous articles, we’d recommend the 50 Best Datasets for Machine Learning, 12 Best Social Media Datasets, and more.

Still can’t find what you need? Lionbridge AI provides custom voice and sound data in 300 languages for your specific machine learning project needs.

Interested? Get high-quality audio data now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.