Training Data is Key for Natural Language Processing Algorithms

Article by Rei Morikawa | February 27, 2019

Natural language is the conversational language that we use in our daily lives. If machines can understand natural language, then the potential use for technology like chatbots would increase dramatically.

Since 2016, chatbot innovation has received a lot of media attention as a new interface that was expected to surpass smartphones.

Now, chatbot services are becoming popular at AI conferences, but at a global scale, chatbot technology is still in the beginning stages. That is because natural language processing technology has not advanced far enough to support chatbots.


What is natural language processing?

Natural language is the spoken English that you use in daily conversations with other people. Up until now, machines could not understand natural language. But now, data scientists are working on artificial intelligence technology that can understand natural language.

There are three basic steps to natural language processing:

  1. The machine processes what the user said, and interprets the meaning according to a series of algorithms
  2. The machine decides the appropriate action in response to what the user said
  3. The machine produces and appropriate output response in a language that the user can understand


In addition, There are three main parts to natural language processing (NLP) technology: understand, action, and reaction.

  1. Understand: First, the machine must understand the meaning of what the user is saying in words. This step uses natural language understanding (NLU), a subset of NLP.
  2. Action: Second, the machine must react to what the user said. For example, if you said “Hey Alexa, order toilet paper on Amazon!” –then, Alexa will understand and do that for you.
  3. Reaction: Finally, the machine must react to what the user said. Once Alexa successfully ordered toilet paper for you on Amazon, she should tell you: “I ordered toilet paper and it should be delivered tomorrow.”

NLP researchers are currently focusing on the “understand” and “reaction” stages.
Now, many companies and data scientist groups are working on NLP research. But NLP applications such as chatbots still don’t have the same conversation ability as humans, and many chatbots are only able to respond with a few select phrases.

Sentiment analysis is an important part of NLP, especially when building chatbots. Sentiment analysis is the process of identifying and categorizing opinions in a piece of text, often with the goal of determining the writer’s attitude towards something. It affects the “reaction” stage of NLP. The same input text could require different reactions from the chatbot depending on the user’s sentiment, so we must also annotate sentiments and make the algorithm learn them.

To improve the decision-making ability of AI models, data scientists must feed large volumes of training data, so those models can use it to figure out patterns. But raw data, such as in the form of an audio recording or text messages, is useless for training machine learning models. The data must first be labeled and organized into a training dataset.


Natural Language Processing and Data Annotation

Entity annotation is the process of labeling unstructured sentences with information so that a machine can read them. For example, this could involve labeling all people, organizations and locations in a document. In the sentence “My name is Andrew,” we would need to make sure that [Andrew] is properly tagged as a person’s name, to ensure that the NLP algorithm is accurate.
※Related article: 50 beginner AI terms you should know

Linguistic text annotation is also crucial for NLP. Sometimes, instead of tagging people or place names, Lionbridge AI crowdworkers are asked to tag which words are nouns, verbs, adverbs, etc. These data annotation tasks can quickly become complicated. Not anyone can accurately label adverbs and prepositions.


Natural Language Processing and Chatbots

There aren’t a lot of annotated datasets available for chatbot training, largely because it’s expensive to create these datasets.

Large datasets are useful for applying to various machine learning models. But for training datasets, quality is equally, if not more important, than the quantity of data points. That’s why it’s important to add annotations for the specific categories that are important to your model, and that’s also where the process gets expensive.

Chatbots specifically require training data that is in the format of text conversations. For example, you can’t train a chatbot using New York Times articles because those aren’t in conversation format.

The reason that chatbot innovation is stuck at the current stage is kind of a chicken and egg problem. Chatbots are low-functioning, so people don’t use them much, so there continues to be a lack of chatbot training data. The key point now is to find a solution for how obtain more chatbot training data.

Chatbots rely heavily on natural language processing, so they are also constrained by the limitations of NLP. Not even the best-available NLP technology today can mimic human conversations.

People are only willing to use chatbots if they make the interaction easier, not harder. At the current stage of natural language processing technology, chatbots might be able to understand words, but not their meaning in context. For chatbot technology to improve, NLP technology must improve first.


Is natural language processing different for different languages?

In English, we have spaces between words, but other languages like Japanese don’t have spaces. The technology required for audio analysis is the same for English and Japanese. But for text analysis, Japanese requires the extra step of separating each sentence into words before we can annotate the individual words.

The whole process for natural language processing requires building out the proper operations and tools, collecting raw data to be annotated, and hiring both project managers and workers to annotate the data. Here at Lionbridge AI, we have built a community of crowdworkers who are language experts, to turn raw data into clean training datasets for machine learning. The typical task for our crowdworkers would involve working with a foreign language document and tagging which words in that document are people names, place names, company names, etc.

For machine learning, training data volume is the key to success. There is a lot more training data available in English than in Japanese.

It’s also true there are more English speakers worldwide, and English is the main language for most of the top companies doing AI research, such as Amazon, Apple, and Facebook. But it doesn’t matter about “which company has the most advanced technology” or “which language is easy to learn,” because the most important factor is the volume of training data that is available.
The volume of training data is equally important for other fields of artificial intelligence, such as computer vision and content categorization. It’s also equally true for other fields of artificial intelligence, that data quality is a bottleneck to technological advancement.


Lionbridge AI’s role in building high-quality datasets for natural language processing

As a ten-year-old translation company, Lionbridge AI’s strength lies in linguistic tasks. Our 500,000 crowdworkers around the globe can collect, create, clean, and annotate AI training datasets for your machine learning and NLP projects.

Large volumes of data are crucial to the success of a machine learning project, but having clean, high-quality data is just as important. Lionbridge’s 500,000 crowdworkers around the globe can ensure that your NLP training data is tagged accurately. Our crowdworkers are language experts who have passed rigorous testing and ongoing reviews. We score their tasks so that even after passing the entrance test, their average scores are constantly changing. Crowdworkers with high scores are permitted to work on more, high-level tasks than crowdworkers with low scores.

Will NLP soon be able to understand language as well as humans? We at Lionbridge AI hope to be a part of that technological revolution!

※This article was originally posted as an interview with Charly Walther, VP of Product and Growth at Lionbridge AI, in the Japanese online media AINOW

The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.