“Most people take for granted that when they speak, they will be heard and understood.”
-Fadi Biadsy, Research Scientist at Google AI
Parrotron is a new project in development that seeks to make speech-to-text programs, virtual assistants, and other voice recognition programs accessible to those with speech impediments. The project is a joint effort between the Speech Team and Google Brain Team at Google AI. Those with neurological or physical speech impediments can have difficulty communicating with people, let alone being understood by automatic speech recognition (ASR) systems. Research scientists and engineers at Google AI are developing Parrotron to help those with speech impairments to better communicate with both humans and ASR systems.
An overview of the Parrotron model
Virtual assistants, chatbots, and other natural language processing (NLP) solutions which utilize ASR are often inaccessible to those with speech impediments. Parrotron uses an end-to-end deep neural network to convert irregular or atypical speech into fluent speech. Most ASR programs work by converting input speech to text, in order for the system to understand what was said and respond accordingly. However, Parrotron can convert speech directly or “parrot” the speech without the need for the speech to text conversion process.
The video below features Dimitri Kanevsky, a research scientist at Google who is profoundly deaf and learned to speak English using Russian phonetic pronunciations. Dimitri demonstrates Google Assistant’s ability to understand his questions before and after using Parrotron.
Parrotron Demo 1
Parrotron is trained in two phases. The first phase uses millions of utterance pairs which include natural utterances paired with synthesized speech utterances. The natural utterances include a variety of accents, dialects, and noise conditions which create a baseline for “typical” speech. The second training phase is used to train the model on the input speaker’s atypical language patterns which are different for every input speaker. Input speakers contribute utterances to the training data to train the model on each speaker’s unique speech characteristics.
Parrotron Demo 2
Google AI has also taken utterances from the ALS speech corpus to train the model on general ALS speech patterns. For people with unique speech characteristics, it seems as though the model may take a longer time to train. In Kanevsky’s case, he contributed 15 hours of speech to train the model on his voice. However, the result is remarkable and the applications of this technology show incredible benefits for those with speech impairments.
Do you require your own audio corpus for ASR, text-to-speech, or other machine learning projects? Lionbridge AI provides a variety of audio data collection and audio data annotation services for machine learning. With a specialization in linguistics and a global multilingual crowd, Lionbridge is a leading provider of custom audio AI training data. Learn more about how Lionbridge can help.
Multilingual Audio Data Annotation Services
Lionbridge provides professional audio data annotation services in over 300 languages.
Some of our most popular languages include: