In a previous article, we discussed the current pace of AI innovation. The shortage of available AI training data is a huge blocker in AI innovation today, leaving some AI researchers and companies frustrated. In recent years, some media channels have encouraged the hype around AI technology, but so far we haven’t seen technological advancements at impressive speeds.
We don’t have enough AI training data because companies often underestimate the amount of data they need, and the time to collect that data. The few companies invested in data collection often refuse to make their data public, usually due to privacy concerns or fear of losing to their competitors.
In this article, let’s take a closer look at how a shortage of AI training data can affect tech innovation.
In 2018, data scientists worked hard to build task-specific AI algorithms, such as autonomous vehicles and machine translation. At the current stage of AI innovation, most data scientists have not seriously considered versatile AI models that can be used across different industries and end-uses. Even the task-specific AI models that we have now are still in their beginning stages, and users are often unsatisfied with them.
AI training data for autonomous vehicles
An autonomous vehicle must be competent in a wide range of machine learning processes before it can drive on its own. From processing image and video data in real-time to safely coordinating with other vehicles, the need for AI is widespread and essential. Autonomous vehicles require a huge volume of image-annotated training data.
To improve image recognition algorithms for autonomous vehicles, we need a large, annotated training dataset to serve as the ground truth. The process of image annotation is a manual, time-intensive task that is often difficult to manage. It can be a colossal task that requires building out the proper operations and tools, collecting raw data to be labeled, and hiring both project managers and workers to do image tagging. This is a big reason why we still don’t have enough image annotation data for autonomous vehicles.
AI training data for machine translation
Like most machine learning models, effective machine translation also requires large training datasets to produce intelligible results. But there’s a shortage of this crucial training data in this field too, especially of high-quality, well-segmented data. (There is a lot of low-quality data that cannot be used to train machine translation… For example, a Wikipedia page in two different languages is useless as machine translation training data because the two pages are not literal translations of each other.) Machine translation is the translation of text by a computer, with no human involvement. The first public demonstration of an early machine translation system by Georgetown researchers was back in 1954, but we still have not perfected the art of machine translation.
Technological advancements have increased the overall effectiveness of machine translation, but it’s still far from perfect. Many companies are now using machine translation instead of traditional translation, leaving us to question whether robots could replace human translators.
There are still many challenges of machine translation that can only be fixed by humans.
Word-sense disambiguation in machine translation
One of the main linguistic issues with machine translation is that machines cannot account for the subtleties in human conversation. While machine translation quality has improved dramatically in recent years, automated translations often result in unnatural phrasing, literal translations, and overall inaccuracies.
Machine translation can have a negative impact on SEO
Search engines like Google and Bing actively search for poor content and spam, so they can easily recognize computer-generated texts. Machine-translated content is often penalized and ranked lower than similar, human-translated content.
Machine translation training data varies by language pair
The quality and amount of available machine translation training data varies hugely depending on the languages involved. For example, machine translation between English-German generally produces higher quality results than between English-Japanese, because more training data is available for the English/German pair.
Machine translation quality varies by language pair
Machine translation systems might be able to easily translate between languages with similar syntactic structures, for example, English-German. But machine translation is more difficult when the two languages involve different syntactic structures, such as Japanese-Turkish. With advances in deep learning, this discrepancy in language pairs with different syntactic structures has been improved greatly. Now, we are mostly left with the above problem of training data for machine translation.
Machine translation is often cheaper and faster than human translation. But machines make mistakes, too. The quality of machine translation is usually much lower than human translation. One reason for this discrepancy in translation quality is a shortage of natural language processing training data. Natural language processing is one subset of artificial intelligence that especially requires a large amount of training data.
At the current stage of natural language processing, machine translation models might understand the words that humans speak, but not necessarily their meaning. Human conversations often discuss multiple topics at the same time, with tangential topics, jokes, and sarcasm all thrown in sporadically. This is hard for a machine to follow and respond to algorithmically. In addition, language evolves over time. We use different words and have different speaking styles from our ancestors hundreds of years ago. That’s why data scientists might never be finished with machine translation innovation. Instead, as new words and phrases become mainstream, data scientists must continue to feed new human-annotated training data to the translation algorithm.
The solution for how to overcome this technological hurdle in NLP is to increase both the quality and quantity of training data. Machine translation training data comes in the form of parallel text translation corpora, a structured set of translated texts between two languages. You can improve machine translation by constantly training the algorithm and giving it experience in translating between two languages. Data scientists need large volumes of quality training data to build an artificial intelligence model, for almost any end-use.
As we saw for autonomous vehicles and machine translation, in most cases we see today, the bottleneck for better AI is to have better training data, rather than better engineers and algorithms. Machine learning models trained with large datasets are more likely to produce accurate results in the real world. The more data you feed the algorithm, the better it can adapt to the unique colloquialisms, phrases, and nuances in different languages.