Like most other machine learning applications, automatic speech recognition (ASR) systems require data from a broad range of participants and environments to perform with accuracy. With this in mind, we at Lionbridge have put together a list of the best publicly available speech recognition datasets. Divided by use case, the data spans anything from speaker identification to speech commands.
General Voice Recognition Datasets
Speech Accent Archive: The speech accent archive was established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. As such, the dataset contains 2,140 English speech samples, each from a different speaker reading the same passage. Furthermore, participants come from 177 countries and have 214 different native languages.
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): RAVDESS contains 24 professional actors (12 female and 12 male), vocalizing the same statements. Not only this, but the speech emotions captured include calm, happy, sad, angry, fearful, surprise, and disgust at two levels of intensity.
TED-LIUM Corpus: The TED-LIUM corpus is made from TED talks and their transcriptions available on the TED website. It consists of 2,351 audio samples, 452 hours of audio. In addition, the dataset contains 2,351 aligned automatic transcripts in STM format.
Google Audioset: This dataset contains expanding ontology of 635 audio event classes and a collection of over 2 million 10-second sound clips from YouTube videos. Moreover, Google used human labelers to add metadata, context and content analysis.
LibriSpeech ASR Corpus: This corpus contains over 1,000 hours of English speech derived from audiobooks. Most of the recordings are based on texts from Project Gutenberg.
Speaker Identification Datasets
Gender Recognition by Voice: This database’s goal is to help systems identify whether a voice is male or female based upon acoustic properties of the voice and speech. Therefore, the dataset consists of over 3,000 recorded voice samples collected from male and female speakers.
Common Voice: This dataset contains hundreds of thousands of voice samples for voice recognition. It includes over 500 hours of speech recordings alongside speaker demographics. To build the corpus, the content came from user submitted blog posts, old movies, books, and other public speech.
VoxCeleb: VoxCeleb is a large-scale speaker identification dataset that contains over 100,000 phrases by 1,251 celebrities. Similar to the previous datasets, VoxCeleb includes a diverse range of accents, professions and age.
Speech Commands Datasets
Google Speech Commands Dataset: Created by the TensorFlow and AIY teams, this dataset contains 65,000 clips, each one second in duration. Each clip contains one of the 30 different voice commands spoken by thousands of different subjects.
Synthetic Speech Commands Dataset: Created by Pete Warden, the Synthetic Speech Commands Dataset is made up of small speech samples. For example, each file contains single-word utterances such as yes, no, up, down, on, off, stop, and go.
Fluent Speech Commands Dataset: This comprehensive dataset contains over 30,000 utterances from nearly 100 speakers. In this dataset, each .wav file contains a single utterance used to control smart-home appliances or virtual assistants. For example, sample recordings include “put on the music” or “turn up the heat in the kitchen”. In addition, all audio contains action, object, and location labels.
Conversational Speech Recognition Datasets
The CHiME-5 Dataset: This dataset is made up of the recordings of 20 separate dinner parties that took place in real homes. Each file is a minimum of 2 hours and includes audio recorded in the kitchen, living and dining room.
2000 HUB5 English Evaluation Transcripts: Developed by the Linguistic Data Consortium (LDC), HUB5 consists of transcripts of 40 English telephone conversations. The HUB5 evaluation series focuses on conversational speech over the telephone with the task of transcribing conversational speech into text.
CALLHOME American English Speech: Developed by the Linguistic Data Consortium (LDC), this dataset consists of 120 unscripted 30-minute telephone conversations in English. Due to the conditions of the study, most participants called family members or close friends.
Multilingual Speech Data
CSS10: A collection of single speaker speech datasets for 10 languages. The dataset contains short audio clips in German, Greek, Spanish, French, Finnish, Hungarian, Japanese, Dutch, Russian and Chinese.
BACKBONE Pedagogic Corpus of Video-Recorded Interviews: A web-based pedagogic corpora of video-recorded interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English.
Arabic Speech Corpus: This speech corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of Modern Standard Arabic (MSA) speech.
Nijmegen Corpus of Casual French: Another single speech dataset, the Nijmegen corpus includes 35 hours of high-quality recordings. In this case, it features 46 French speakers conversing among friends, orthographically annotated by professional transcribers.
Free Spoken Digit Dataset: This simple dataset contains recordings of spoken digits trimmed so that they are silent at the beginnings and ends.
Spoken Wikipedia Corpora: This is a corpus of aligned spoken articles from Wikipedia. In addition to English, the data is also available in German and Dutch.
Still can’t find what you need? With over two decades of hands-on experience, Lionbridge has helped the world’s largest companies to train, test, and fine-tune ASR systems. Our community of over 1 million qualified linguists can help you obtain the speech recognition datasets you need to train your model effectively. Get in touch today for a free consultation.