8 Best Voice and Sound Datasets for Machine Learning

Article by Rei Morikawa | November 19, 2019

Voice and sound activated machine learning systems such as Amazon Alexa or in-car navigation apps rely on large volumes of audio recordings. But voice and sound data is often more difficult to find than text and image datasets. Voice and sound datasets tend to require more preparation to collect them, and more data cleansing and annotation tasks before they are useful for natural language processing models. 

In this article, we’ll introduce eight voice and sound datasets for your natural language processing projects.


Natural Language Datasets

Common Voice: Common Voice is an open source, multilingual dataset of voices that anyone can use to train speech-enabled applications. The dataset currently has 2,454 hours of recorded speech in 29 languages. It also includes demographic metadata for training speech recognition engines, including age, gender, and accent. 

VoxCeleb: VoxCeleb is dataset of short clips of human speech extracted from interview videos uploaded to YouTube. The dataset contains 2,000 hours of speech from 7,000 speakers.

2000 HUB5 English: This dataset includes transcripts from 40 telephone conversations in English, with their corresponding speech files. 

CALLHOME American English Speech: This dataset includes 120 unscripted, 30-minute telephone conversations in English. Most of the calls are between family members or close friends. 


Ambient Noise Datasets

AudioSet Ontology by Freesound Datasets: The AudioSet Ontology is a collection of 297,144 audio samples categorized into over 600 sound classes. It includes a variety of everyday sounds, including music, human, and animal sounds. 

Urban Sound Dataset: This dataset contains 8,732 labeled sound recordings that are annotated with the start and end times. This includes background sounds such as air conditioner, car horn, and engine idling. 


Other Voice and Sound Datasets

AudioSet (Google): Google’s AudioSet is a large-scale dataset of over 2 million sound clips categorized into 632 audio event classes. The sound clips are each 10 seconds long, drawn from YouTube videos. The event classes cover a wide range of human and animal sounds, musical instruments, and common everyday environmental sounds. 

Free Spoken Digit Dataset: This dataset is a collection of 2,000 recordings of spoken numerical digits in English. The researchers have trimmed these clips so that they have almost no silence at the beginning and end.


Still can’t find the data you need? Lionbridge provides audio data services to develop, calibrate, and improve voice-enabled applications. Reach out to our team to unlock access to a network of 500,000 qualified linguists, data scientists, and project managers who can collect voice and sound data for a wide range of use cases. 

Interested? Get high-quality data now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.