Many machine learning companies struggle with finding solutions for rapidly build datasets to train their algorithms. In this piece, we outline the best outsourced data annotation companies and the services they provide.
Data annotation is the process of adding contextual information to raw data to serve as training examples for machine learning models. For more information about what AI training data is, check out our article on the topic.
Before deciding to annotate data internally or externally, consider the following factors:
- How much training data do you need? There is no standard answer for the amount of data required to achieve adequate model performance — this very much depends on the type of model, the training method you’re using, as well as the acceptable tolerance for errors.
- Do annotators require specialized expertise? Building a medical imaging model and an entity extractor are two different problems, each with its own unique set of issues. Medical, legal or technical data generally requires specialized skills to annotate, meaning annotators will need a significant amount of education and training prior to handling data.
- Do you have the bandwidth to develop annotation tools in-house? In taking on annotations internally, you’ll need to invest in the annotation process itself, from designing annotation tools from scratch to creating annotator onboarding materials.
Best Data Annotation Companies
Amazon Mechanical Turk: Mechanical Turk (aka MTurk) is a platform by Amazon where requesters pay workers who will help them finish a human intelligence task (or HIT) by working on micro-tasks or assignments. Sample HITs are transcribing text or labeling images. The output can be used to build training and validation datasets for machine learning models.
Lionbridge AI: Similar to Mechanical Turk, Lionbridge AI is a solution to get crowdsourced human-annotated data. However, unlike Mechanical Turk, Lionbridge AI manages the entire data annotation process, from designing workflows to sourcing qualified workers. With over 500,000 contributors across 300 languages, Lionbridge AI covers both simple data annotation tasks as well as linguistically complex long-term projects. Clients can either send raw data or instructions or get custom staffing solutions when there are specific requirements such as secure locations, dedicated workforces, or custom devices.
Edgecase: Edgecase is a data factory that provides synthetic data and data labelling services for machine learning companies. With ties to universities and industry experts, Edgecase provides data annotation and custom built complex datasets to AI companies in retail, agriculture, medicine, security and more.
Scale: With a focus on computer vision applications, Scale offers a suite of managed labeling services via its annotation API to create the ground truth for machine learning models.
Hive: A end-to-end solution annotation platform that allows users to create training datasets for content categorization, computer vision, and more.
Figure Eight: Formerly known as Crowdflower, Figure Eight provides human-in-the-loop software to automate tasks for machine learning algorithms.
Humans in the Loop: Data labelling to train and improve your computer vision machine learning solutions. Use cases include face recognition, self-driving cars, and figure detection.
Clickworker: Clickworker is a micro tasking marketplace, catering data management and web research services as well as AI algorithms training.
Appen: With a crowd of 400,000 workers on the platform, Appen has experience annotating a wide variety of machine learning data types including speech, text, image and video.
Dbrain: Dbrain is a platform that connects 20,000 crowdworkers with data scientists to prepare and label data and deliver high-accuracy datasets ready for machine learning.
Need to annotate large datasets for machine learning? At Lionbridge AI, our network of experienced annotators are trained to label text and image data in over 300+ languages.