What is Data Annotation and How is it Used in Machine Learning?

Article by Rei Morikawa | October 02, 2020

Data annotation is the task of labeling data with metadata in preparation for training a machine learning model. Both data and metadata come in many forms, including content types such as text, audio, images, and video. These annotated datasets can be used to train autonomous vehicles, chatbots, and translation systems.

In this article, we’ll explore six different types of data annotation and their most common uses in machine learning.

 

What is Data Annotation in Machine Learning?

Data annotation is the process of adding metadata to a dataset. This metadata usually takes the form of tags, which can be added to any type of data, including text, images, and video. Adding comprehensive and consistent tags is a key part of developing a training dataset for machine learning.

Data annotation is an indispensable stage of data preprocessing because supervised machine learning models learn to recognize recurring patterns in annotated data. After an algorithm has processed enough annotated data, it can start to recognize the same patterns when presented with new, unannotated data. As a result, data scientists need to use clean, annotated data to train machine learning models.

 

Data Annotation Types

There are many different types of data annotation, all of which suit different use cases. In the section below, we run through a few of the more common annotation types that are used for popular machine learning projects. It’s by no means an exhaustive list, but should give you some idea of the breadth of the field of data annotation. Let’s dive in:

 

Semantic Annotation

Semantic annotation is the task of annotating various concepts within text, such as people, objects, or company names. Machine learning models use semantically annotated data to learn how to categorize new concepts in new texts. This can help to improve search relevance and train chatbots.

 

Image and Video Annotation

Image annotation comes in a variety of forms, from bounding boxes, which are imaginary boxes drawn on images, to semantic segmentation, where every pixel in an image is assigned a meaning. This label usually helps a machine learning model to recognize the annotated area as a distinct type of object. This type of data often serves as a ground truth for image recognition models that can recognize and block sensitive content, guide autonomous vehicles, or perform facial recognition tasks.

Similar to image annotation, video annotation often involves adding bounding boxes, polygons, or keypoints to content. This can be done on a frame-by-frame basis, with these frames then stitched together to help track the movement of the annotated object, or in the video itself using a video annotation tool. This type of data also plays an important role in the development of computer vision models for tasks like object tracking and localization.

 

Text Categorization

Text categorization and content categorization refer to the task of assigning predefined categories to documents. For example, you can tag sentences or paragraphs within a document by topic, or organizing news articles by subject such as domestic, international, sports, or entertainment.

 

Entity Annotation

Entity annotation is the process of labeling unstructured sentences with information so that a machine can read them.

Within entity annotation, there are a multitude of processes that can be layered to create an understanding of language. An exhaustive list would be too long to reproduce here, but these examples should give a sense of the broad possibilities on offer:

  • Named entity recognition: Named entity recognition (NER) refers to the classification of named entities present in a body of text. These entities are labeled based on predefined categories such as person, organization, and place. Named entity recognition models add semantic knowledge to your content, making it easy for individuals and systems to quickly identify and understand the subject of any given text.
  • Entity linking: This is the process of annotating the relationship between two parts of a text. For example, you can tag the company and employee, or person and their hometown as related concepts.

 

Intent Extraction

For chatbots, it’s important for the algorithm to accurately determine the user’s intent when they type in a query. For example, consider the following queries for a chatbot on a restaurant website:

I agree to pay the cancellation fee and cancel the reservation.

How much is the cancellation fee?

Do you charge a cancellation fee for no-shows?

All three examples contain the phrase ‘cancellation fee’, but all have different intents. In the first sentence, the intent is for the chatbot to take an action: cancel the reservation. The second and third sentences share a different intent: to receive more information about the restaurant’s cancellation fee policy. If the chatbot can’t recognize this, it might cancel the user’s restaurant reservation by mistake.

Intent extraction is the technical solution to the above problem. For intent extraction, we explicitly label user intents in the data on a phrase or sentence level. This way, the algorithm has a library of ways that people phrase certain requests, and the algorithm can begin to extrapolate about new sentences based on that ground truth.

 

Phrase Chunking

Phrase chunking consists of tagging parts of speech with their linguistic or grammatical meaning. For example, some machine learning training datasets would require every word to be annotated with its part of speech, such as ‘noun’ or ‘verb’.

The above example of phrase chunking was created in Brat, the popular annotation tool for natural language processing.

 

Performing Data Annotation

Annotating your data can be a significant undertaking, but you don’t need to spend hours on data annotation by yourself. There are many third party companies and individuals who can help you. For example, Lionbridge AI can help you with data annotation tasks for text, image, video, and audio datasets. We’ll help you to define the kind of data annotation services you’re looking for, develop a clear gold standard for your workers, and build a comprehensive dataset that’s perfect for training your machine learning model. Get in touch below to learn more about how we can help.

Start annotating your data today
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

    Welcome!

    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.