The global text analytics market is currently valued at $3.95 billion and is expected to reach a whopping $10.38 billion by 2023. This growth is primarily driven by an increase in the sheer volume of unstructured data as well as technological advances in machine learning. In an open conversation with Carl Hoffman (CEO of Basis Technology) and Charly Walther (VP of Product & Growth at Lionbridge AI), we gain insight into the global text analytics supply chain, how natural language processing differs from other fields of machine learning, and the challenges of maintaining multilingual data quality.
Lionbridge: Please tell us a bit about your background in machine learning, how you got into the current role, and some context about the services you offer.
Carl: Since my time at MIT, I’ve spent my career applying artificial intelligence to human language. That passion led me to found Basis Technology initially to provide software globalization services to American companies entering Asian markets. Since then, the company has expanded into text analytics, natural language processing, and digital forensics. We now provide AI solutions for human language problems to governments and major companies all over the world. Charly: Having started my career as a product manager at Uber, machine learning had always been ubiquitous. Predicting demand and supply from riders and drivers is a perfect problem for artificial intelligence. When I joined the self-driving car team, I got to see the vast need for human-labelled data in the form of bounding boxes that mark cars, pedestrians, etc. in images. That’s when I became interested in offering a similar human-based annotation service for high-quality text and audio data. With Lionbridge AI we have built exactly that, a way to get training data for machine learning applications across many different languages.
L: Basis Technology was founded in 1995, before the hype surrounding machine learning and AI really took off. As one of the pioneers of machine learning, how have you seen the landscape develop?
Carl: Machine learning for text started in the 90s, around the same time as Basis Technology. We first started with simple language identification models, then moved onto entity extraction, name matching and morphology. Now, we’re building deep learning models for nearly everything language related.
Since the 90s, natural language processing has gone mainstream. The average person on the street might not know what natural language processing is, but knows Siri and Alexa. In fact, virtual assistants like Alexa are predicted to reach 55 percent of U.S. households by 2022.
The challenge to wider adoption is AI’s performance; deep learning requires much faster computers or specialized hardware. The main reason for this explosion in natural language processing technology is that we have been successful in tackling the challenge of performance. A lack of fast computers and specialized hardware was always going to be a huge blocker to mainstream adoption of machine learning.
L: Lionbridge, on the other hand, started off as a translation company. Could you speak a little bit about Lionbridge’s journey towards machine learning and the AI training data business?
Charly: Lionbridge’s transition to AI training data came about very naturally. If you think about what it takes to run an AI training data business, you need two ingredients. On the one hand, you need the people – a large crowd of committed and skilled contributors and the processes to manage them. On the other hand, you need the technology platform that allows for job distribution, quality management, and project management. Lionbridge AI has spent the last ten years perfecting these two things for the translation business, an area where speed, competitive pricing, and quality are paramount. It was a natural extension to go from translation to curating all kinds of language-related data. Furthermore, we had already seen demand from our customers – the shockwaves of that explosion in machine learning technology that Carl just mentioned. More and more, they wanted our help with data that wasn’t quite translation. They came to us, a translation service, for natural language processing training data because they didn’t know where else to turn. Having previously seen the vast demand for labelled training data in the computer vision field at Uber, I thought there would be a great opportunity in offering a scalable way to acquire data in the field of natural language processing.
L: What does the working relationship between Lionbridge AI and Basis Technology look like? How do your solutions relate to and complement each other? How do you each see the firms’ collaboration developing?
Carl: The relationship between Basis Technology & Lionbridge AI is simple, it’s all about data; data is the new gold. What distinguishes one natural language processing provider from another is not only the quality of their algorithms, but also the quality of data. Tools and algorithms eventually reach a kind of parity – barring some breakthrough in technique or technology. So ultimately what will differentiate one AI-powered natural language processing solution from another will be the quality of training data. If the data was sloppily annotated, or poorly chosen in terms of genre, vocabulary, and variety, then the learning will be of poorer quality. That’s why it’s important to have a partner of whom we feel confident is delivering training data that has been carefully cross-checked, adheres to tagging guidelines, and has the initiative to propose changes to guidelines as we learn more about the issues for a particular language.
L: In your experience, what impact does data quality have on deep learning models? Why is data quality particularly important for multilingual natural language processing applications?
Charly: Generally speaking, while you do need a certain amount of data points to get a model off the ground, eventually quality trumps quantity. While in some applications, a low level of accuracy can be helpful, an accuracy of 80% simply doesn’t cut it in safety-critical applications such as self-driving cars or quality-critical applications such as enterprise translation. Since machine learning very much suffers from “garbage in, garbage out,” it takes highly accurate data to solve that last twenty, five, or one percent.
L: Natural language processing in particular suffers from a variety of data-related issues. First, unlike other fields such as finance or ride-sharing, which produce massive amounts of structured data, there aren’t many natural sources of well labeled natural language processing data. Take machine translation for example.
Carl: Of course, quality is in the eye of the consumer; it’s important to get clarification from the user about what constitutes an error and how to count them. Data quality is a well-annotated training corpus, but it’s also about whether the data is “clean” to start with. No amount of deep learning is going to help you if you machine learning on over a medium size dataset that’s corrupted or mangled. Once you normalize (e.g. transform all meaningless spellings differences such as “color” and “colour” to a single form) and remove the issues (e.g. character corruption or mixed languages), it improves the end results. These are very simple things, but even big companies might not have the right processes to deal with them. By cleaning data as a first pass, you’ll get a better trained model and much better results.
L: Basis Technology builds and maintains AI solutions every day using training data – in your experience, which other parts of the data procurement process have a significant impact on the model you build (and how do you compensate for this)?
Carl: We have to understand that differences of opinion among our human annotators in how to tag data are inevitable. Tagging sentences for sentiment is especially difficult as people frequently feel—no pun intended—differently about whether a statement is positive or negative. Those differences need to be cross-checked to reach an agreement on every tag—which may trigger a clarification in the annotation guidelines or a discussion. Annotation requires a tight feedback loop starting with agreeing on concepts, vocabulary, and giving feedback or asking questions where there is ambiguity. Ubiquity of big data is not strictly true; the vast majority of enterprise data is only “medium data”. With smaller datasets, the signal-to-noise ratio gets worse. Smaller and disparate data sets benefit from a normalization pass; preprocessing helps significantly. When you machine learn on a very large scale, it’s much more forgiving because preponderance of “good data” renders the small percentage of “bad data” insignificant.
L: Both Basis Technology and Lionbridge are particularly focused on natural language applications. What about natural language processing makes it ripe for development in relation to other fields of machine learning?
Charly: It seems to me that in the past couple of years, computer vision has had massive breakthroughs, which natural language processing, despite some advances in machine translation, is still waiting for. I’m not sure whether that’s because deep learning has proven a better fit for image-based problems than text-based problems, or whether that’s due to the availability of data. One possible explanation might be that structuring image data is relatively uniform. No matter the problem, the solution is usually drawing boxes around objects. The more images labeled with bounding boxes we have, the more fodder we have for computer vision problems. Natural language processing requires differently labeled text depending on the problem you are trying to solve. Given the same input sentence, tasks such as machine translation, sentiment analysis, categorization, or entity extraction all require different labels. For example, the sentence “I love my dog” needs to be labeled as “Pets” for categorization, “Positive” for sentiment analysis, and “Ich liebe meinen Hund” for a German translation model. If you now add on top that you need this labeling for every single language, you end up with many separate, and potentially much smaller pools of training data.
Carl: Despite these unique blockers, Basis Technology and Lionbridge are aligned in believing that natural language processing is ripe for development. Massive strides have been made in the field of computer vision, as Charly knows from first-hand experience, but there is very little difference in the availability of raw, untapped data between computer vision and NLP. Just as sensors and cameras are everywhere for computer vision, organizations of all sizes are building very diverse, disparate corpora of text without even realizing it—until someone starts to mine it. Whether it is medical records, resumés, social media traffic, or product inventory… human language is ubiquitous. In the end, it’s really just a matter of how much of it can you use, what can you extract, and what the privacy considerations are. While we’re still waiting for that big breakthrough, all the pieces seem to be in place.
L: How is Basis Technology pushing natural language processing forward?
Carl: Basis Technology is pushing natural language processing forward by providing the largest set of capabilities we can, in as many languages as we can, in a way that is as easy to consume as possible. Our goal is to be the neutral independent provider not locked up inside Google.
L: Where do you see Lionbridge AI’s platform in 5 years?
Charly: My hope is that easier and cheaper access to all kinds of training data will give birth to machine learning applications that we can barely even think of today. All those services that would be extremely useful, yet don’t have a large enough audience or clear enough use case to be ROI-positive at the current cost of acquiring the necessary training data – we at Lionbridge AI would love to play a part in enabling the development of those services. My favorite example is when we collected speech samples of immigrants in Japan who speak broken Japanese for a car manufacturer researching ways to improve the navigation voice-assistant in their cars. Not many companies today can afford to acquire the training data necessary to serve such a niche demographic. This hints at a general problem with AI in that it’s biased towards the masses. Currently, its answers most resemble those of the people who produced most of the training data without considering minorities. With Lionbridge AI, we’re hoping to streamline the process of acquiring data to a point where only your imagination, not your cost of data acquisition, becomes the bottleneck to developing new ways for AI to improve people’s lives.
L: What are some strategies, recommendations, or advice you have for companies looking to improve the quality of their data?
Charly: Before data goes anywhere near your algorithm, make sure it’s properly cleaned, appropriately tagged and highly relevant. It’s also important to have an appropriate volume, since having too few examples will hinder your algorithm’s ability to spot trends and improve its accuracy. Great training data is a rare commodity that takes time to source or create.
It’s worth taking the time to build a solid plan around your training data and source trustworthy partners who share your vision. When you’re sure your data is absolutely aligned with your goals, your model will have every chance of outshining your competitors.
Carl: By quality data, you really mean data that will produce the best performing model possible. So, I’d start by having clearly defined metrics for what constitutes quality or an improvement in your model. Establish the framework and tools for measuring the quality of your results so it’s clear and repeatable. You should have your gold standard data—the data from which you will measure the precision, recall and F-score of your trained model—carefully chosen for a balance of genre, vocabulary, and have it double- or triple-annotated for consistency, and then lock it well away from the training corpus. Obtaining quality training data involves all the points Charly mentioned, as well as clear documentation (really truly understanding the unique issues a language may pose), communication with annotators, and tight integration routines in the annotation process help prevent the vast majority of misunderstandings (e.g. Hubble telescope mirror fiasco over measurements). Producing quality data takes a significant amount of labor and time, but it’s well worth the results.