What is Data Annotation and How is it Used in Machine Learning?

Article by Rei Morikawa | June 12, 2019

Data annotation is the task of labeling data, which could be in any form such as text, audio, images, or video. In this article, we’ll explore the different types and uses for data annotation in machine learning.

 

Data annotation for machine learning

Data scientists need to use clean, annotated data to train machine learning models.

Data annotation is an indispensable stage of data preprocessing in supervised learning. Machine learning models learn to recognize recurring patterns in the annotated data. After an algorithm has processed enough annotated data, it can start to recognize the same patterns when presented with new, unannotated data.

 

Data annotation services for machine learning

What are the different types and uses for data annotation in machine learning?

 

Semantic annotation

Semantic annotation is the task of annotating various concepts within text, such as people, objects, or company names. Machine learning models use semantic annotation as reference, to categorize new concepts in new texts. The end uses of semantic annotation include improving search relevance and training chatbots.

 

Image and video annotation

Image recognition and video processing are used to train machine learning models to recognize and block sensitive content, guide autonomous vehicles, and categorize e-commerce product listings. These machine learning models need to understand the content of images and videos. Data scientists need large volumes of accurately annotated data to serve as the ground truth, to train machine learning models for image and video recognition. For image and video annotation, we commonly use bounding boxes, which are imaginary boxes drawn on images. The contents of the bounding box are annotated to help a machine learning model recognize it as a distinct type of object.

 

Text categorization and content categorization

Text categorization and content categorization refer to the task of assigning predefined categories to documents. For example, you can tag sentences or paragraphs within a document by topic, or organizing news articles by subject such as domestic, international, sports, entertainment, etc.

 

Entity annotation

Entity annotation is the process of labeling unstructured sentences with information so that a machine can read them.

At Lionbridge AI, we leverage our crowd of over 500,000+ language experts to annotate data for machine learning models. For example, Lionbridge AI’s crowd can annotate news articles in 300 different languages and tag which words are people, organization, or company names.

Within entity annotation, there are a multitude of processes that can be layered to create an understanding of language. Many solutions have several of these baked into their systems, enabling data scientists to manipulate their data in a variety of useful ways. An exhaustive list would be too long to reproduce here, but these examples should give a sense of the broad possibilities on offer.

  • Named entity recognition: Named entity recognition (NER) refers to the classification of named entities present in a body of text. These entities are labeled based on predefined categories such as person, organization, and place. Named entity recognition models add semantic knowledge to your content, making it easy for individuals and systems to quickly identify and understand the subject of any given text.
  • Intent extraction: For chatbots, it’s important for the algorithm to accurately determine the user’s intent when they type in a query. For example, consider the following queries for a chatbot on a restaurant website.

I agree to pay the cancellation fee and cancel the reservation.

How much is the cancellation fee?

Do you charge a cancellation fee for no-shows?

All three examples contain the phrase cancellation fee, but all have different intents. In the first sentence, the intent is for the chatbot to take an action: cancel the reservation. The second and third sentences share a different intent, to receive more information about the restaurant’s cancellation fee policy. If the chatbot can’t recognize this, it might cancel the user’s restaurant reservation when the user hadn’t quite decided whether they wanted to cancel, depending on the cancellation fee policy.

Intent extraction is the technical solution to the above problem. For intent extraction, we explicitly label user intents in the data on a phrase or sentence level. This way, the algorithm has a library of ways that people phrase certain requests, and the algorithm can begin to extrapolate about new sentences based on that ground truth.

 

Entity linking

Entity linking is the process of annotating the relationship between two parts of a text. For example, you can tag the company and employee, or person and hometown.

 

Phrase chunking

Phrase chunking consists of tagging parts of speech with their linguistic or grammatical meaning. For example, some machine learning training datasets would require every word to be annotated with its part of speech (noun, verb, etc.)

The above example of phrase chunking was created in brat, the popular annotation tool for natural language processing. GATE is similar annotation tool but it is more complex with a steeper learning curve.

This sounds super complex, but you don’t need to spend hours on data annotation by yourself. There are many third party companies and individuals who can help you. Lionbridge AI is a crowdsourcing company that can help you with data annotation tasks for text, image, video, and audio datasets. It’s important to be clear about what kind of data annotation services you are looking for, and give clear instructions to the third party workers, because doing so will improve your ROI. Do your research and find the right partner who can help you build a successful machine learning algorithm.

The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.