Top 6 Speech Data Collection Services for Machine Learning

Article by Charly Walther | May 02, 2019

Collecting audio recordings or utterances from humans is an important input for a wide range of machine learning applications. Most applications that are controlled via voice commands, such as voice assistants like Siri or Alexa, or more specific applications such as in-car navigation systems rely on training data that is sourced from actual humans recording their voice.

For a company to acquire such training data, two components are necessary.

  1. A tool via which humans can record their speech data according to the specifications required by the engineers.
  2. The crowd management capabilities necessary to recruit and manage the contributors that record their speech.

Rather than building those tools and recruiting those contributors, many companies who need this type of machine learning data will rely on third-party providers to help them collect speech data. In this article, we have assembled several options for outsourcing speech data collection.


1. Lionbridge AI

Lionbridge AI provides voice data collection services to startups as well as several of the largest technology companies in the world. With solution centers in 27 countries and over 500,000 contributors across 300 languages, Lionbridge AI covers both simple speech collection use cases as well as linguistically complex long-term projects. Clients can either send raw data or instructions which Lionbridge AI will serve via its online platform and remote crowd, or clients can get custom staffing solutions when there are specific requirements such as secure locations, dedicated workforces, or custom devices.


2. Globalme

Globalme is a Vancouver based company offering speech data collection services. They don’t offer independent access to tooling where one could bring their own crowd but they do offer crowd management and sourcing of participants. Use cases covered in the past include collecting voice samples for smartwatches, speaker systems, in-car speech systems and general voice assistants. Outside of speech data collection the company also offers testing and localization services.


3. Appen

Appen is based in Australia and provides a range of different data annotation and collection services. While a large share of their work revolves around moderating content, they also support voice data collection across a large number of languages. Being a service provider, Appen does not offer tools to be used by other companies but directly provides the data that they source through their crowdsourcing services.


4. Clickworker

Clickworker is a company based in Germany that offers a wide range of data annotation and collection services. Some of the services they are offering include data annotation for eCommerce and retail solutions, as well as survey tasks. In terms of audio data, the company covers both collection and transcription of voice recordings in a variety of languages. Clickworker also supports a mobile application to make it easier to collect speech samples from their contributors.


5. Magpi

Magpi is a mobile data collection platform with employees in Nairobi, Washington, and London. Having originated in the healthcare space, Magpi supports particularly NGOs in their data collection efforts across many countries. Their mobile phone platforms allows for flexible creation of forms that can be operated across a range of devices. Amongst those flexible components, GPS and voice data collection are both supported.


6. Socialcops

Socialcops provides a platform for managing unstructured data. As part of this platform that supports various data management and visualization tasks, Socialcops also includes data collection capabilities. To collect data, Socialcops provides a mobile application that lets users put together custom tailored forms with various functionalities. In addition to audio, video, and image capturing capabilities, the platform supports data validation, team management and offline working functions.


At Lionbridge AI, we’re dedicated to educating our audience about the best options to create, collect, annotate, or validate data for machine learning applications. If you have specific questions or further needs regarding AI training data tasks, please feel free to reach out to our team.

The Author
Charly Walther

Charly Walther is VP of product and growth at Lionbridge AI, a global, people-powered translation platform optimized for developers of multilingual machine learning and AI applications. With 20 years of know-how in providing AI training data, Lionbridge has an impressive track record of successful projects with the world’s top technology companies. Walther joined Lionbridge from Uber, where he was a product manager in Uber’s Advanced Technologies Group.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.