
Collecting audio recordings or utterances from humans is an important input for a wide range of machine learning applications. Most applications that are controlled via voice commands, such as voice assistants like Siri or Alexa, or more specific applications such as in-car navigation systems rely on training data that is sourced from actual humans recording their voice.
For a company to acquire such training data, two components are necessary.
- A tool via which humans can record their speech data according to the specifications required by the engineers.
- The crowd management capabilities necessary to recruit and manage the contributors that record their speech.
Rather than building those tools and recruiting those contributors, many companies who need this type of machine learning data will rely on third-party providers to help them collect speech data. In this article, we have assembled several options for outsourcing speech data collection.
1. Lionbridge AI
Lionbridge AI provides voice data collection services to startups as well as several of the largest technology companies in the world. With solution centers in 27 countries and over 500,000 contributors across 300 languages, Lionbridge AI covers both simple speech collection use cases as well as linguistically complex long-term projects. Clients can either send raw data or instructions which Lionbridge AI will serve via its online platform and remote crowd, or clients can get custom staffing solutions when there are specific requirements such as secure locations, dedicated workforces, or custom devices.
2. Globalme
Globalme is a Vancouver based company offering speech data collection services. They don’t offer independent access to tooling where one could bring their own crowd but they do offer crowd management and sourcing of participants. Use cases covered in the past include collecting voice samples for smartwatches, speaker systems, in-car speech systems and general voice assistants. Outside of speech data collection the company also offers testing and localization services.
3. Appen
Appen is based in Australia and provides a range of different data annotation and collection services. While a large share of their work revolves around moderating content, they also support voice data collection across a large number of languages. Being a service provider, Appen does not offer tools to be used by other companies but directly provides the data that they source through their crowdsourcing services.
4. Clickworker
Clickworker is a company based in Germany that offers a wide range of data annotation and collection services. Some of the services they are offering include data annotation for eCommerce and retail solutions, as well as survey tasks. In terms of audio data, the company covers both collection and transcription of voice recordings in a variety of languages. Clickworker also supports a mobile application to make it easier to collect speech samples from their contributors.
5. Magpi
Magpi is a mobile data collection platform with employees in Nairobi, Washington, and London. Having originated in the healthcare space, Magpi supports particularly NGOs in their data collection efforts across many countries. Their mobile phone platforms allows for flexible creation of forms that can be operated across a range of devices. Amongst those flexible components, GPS and voice data collection are both supported.
6. Socialcops
Socialcops provides a platform for managing unstructured data. As part of this platform that supports various data management and visualization tasks, Socialcops also includes data collection capabilities. To collect data, Socialcops provides a mobile application that lets users put together custom tailored forms with various functionalities. In addition to audio, video, and image capturing capabilities, the platform supports data validation, team management and offline working functions.
At Lionbridge AI, we’re dedicated to educating our audience about the best options to create, collect, annotate, or validate data for machine learning applications. If you have specific questions or further needs regarding AI training data tasks, please feel free to reach out to our team.