OK Google, How does Alexa Work? — Voice Assistants Explained

OK Google, How does Alexa Work? — Voice Assistants Explained
Article by Rei Morikawa | January 22, 2019

In our previous articles, we’ve already addressed the current shortage of AI training data, and the effects that it has on AI innovation. This time, let’s take an in-depth look at one particular use case that is near and dear to many people’s daily lives — voice assistants.

 

Voice assistants enjoy growing popularity

The $55-billion voice recognition industry has been forecast to grow at an annual rate of 17% between 2018 and 2025. AI software dominates the speech and voice recognition market, and is expected to grow at an astronomical rate of 30% in the same period.

Voice assistants come in small packages and can perform various actions after hearing a wake word or command (such as Alexa or OK Google). They can turn on lights, play music, check the weather forecast, place online shopping orders, make restaurant reservations, etc.

This year, new car models with built-in voice assistants will be released. Toyota has already began integrating Amazon Alexa in new Toyota and Lexus models. Meanwhile, BMW, Mercedes-Benz, and Ford have also all began implementing voice assistants in their new models. Next time, you can ask Alexa for directions or parking information without whipping out your phone. Data scientists are also designing the in-car voice assistants to sync with the home voice assistants that you might already own.

 

Technology behind voice assistants

The average person on the street has definitely heard of the most popular voice assistants, mainly Alexa and Siri. But they might not know about natural language processing and speech recognition software, the technology behind our favorite voice assistants. Speech recognition software works by analyzing the user’s speech, using the following basic process.

  1. Filter the words that the user says;
  2. Digitize the user’s speech into a format that the machine can read;
  3. Analyze the user’s speech for meaning;
  4. Decide what the user needs based on previous input and algorithms.

 

For the previous input and algorithms in step 4, large amounts of audio training data are required to build effective voice assistants that can understand and fulfill user commands.

Of course, machine learning researchers can only use raw audio data to train their algorithms if it has been cleaned and labeled. For voice assistants, the audio training dataset should include a large volume of accurate language data. This ensures that the algorithm can understand and respond to human speech in different environments and contexts. For example, the Chinese language has 130 spoken dialects and 30 written languages. This creates a huge demand for cutting edge tech solutions and processed training datasets.

The audio training dataset should also include different variations of the same request. For example, users who want to know whether it will rain tomorrow might ask different questions such as:

  • Is it going to rain tomorrow?
  • What is the chance of rain tomorrow?
  • What is the weather forecast for tomorrow?
  • Should I carry an umbrella tomorrow?

An effective voice assistant should be trained to understand that these are all different ways of asking the same question of whether it will rain tomorrow.

 

Limitations of voice assistants

Voice assistants rely heavily on natural language processing, so they are also  constrained by the limitations of natural language processing. Natural language processing is often associated with chat or text interfaces, but it is also important for audio language technologies such as voice assistants, mobile phones, and call centers. At its current stage, natural language processing struggles with the complexities inherent in elements of speech such as metaphors and similes. Most human speech is not linear. We sometimes forget what we were talking about, ask tangential questions, or ask about multiple different things at once. This is tough for a machine to follow algorithmically. In addition, voice involves unique challenges that text does not have to deal with, such as background noise and accents. The algorithm must overcome these additional challenges to deliver a good user experience.

Another problem with voice assistants is that they are often biased towards the masses. The answers that Alexa or Siri give closely match the needs of whoever produced most of the training data, but they might not be helpful for minorities. This is unfortunate since voice assistants first became popular for their role in streamlining people’s lives.

 

The future of voice assistants

I hope that better access to audio training data will give birth to machine learning applications that we cannot even imagine today. To improve voice assistants even further, it’s important to make sure that they serve everyone, including minorities and niche demographics.

Lionbridge AI had the opportunity to play a part in the development of this kind of service recently. We collected speech samples of foreigners living in Japan who speak imperfect Japanese. This was used for a car manufacturer to improve the navigation voice assistant in their cars. Few companies today can afford to acquire the training data necessary to include such a niche demographic.

Another example where voice assistants at their current stage are failing to serve minorities is for people with speech impairments. For those people, Siri might not be able to understand a single sentence that they say. Google is working to improve access for people with speech impairments, and Amazon has even made voice assistants for sign language users.

There are also several emerging tech companies that are currently working to make voice assistants accessible to people with speech impairments. For example, Danny Weissberg created an app called Voiceitt after his grandmother suffered a stroke and lost her ability to speak. Voiceitt is focused on delivering speech recognition technology that understands non-standard speech. The Voiceitt app works by having the person with speech impediments create their own personal dictionary, which is then translated into standard speech to control other voice-enabled devices. To create a dictionary, the user composes and then reads out everyday phrases like “I’m hungry,” or “Turn on the lights.” The Voiceitt software records the speech and gradually learns the user’s particular pronunciation. Then, after being trained, the app will act like an instant translator. The user with speech impediments speaks a phrase, and the Voiceitt app reads or types it out in standard speech for voice assistants to process.

The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.