How to Create Value from Text Data: An Interview with AI Startup Co-founder Federico Pascual

Article by Daniel Smith | March 29, 2019

Federico Pascual is the co-founder and COO of machine-learning startup MonkeyLearn. Based in San Francisco, their platform enables companies to bolster their service offerings through a wide range of NLP solutions. With a variety of informative guides for subjects such as text analysis and sentiment analysis, MonkeyLearn’s website is also a great resource for anyone interested in natural language processing.

Our discussion with Federico explored the technology behind these text-mining models, the importance of training data, and the future of NLP in business. For more in-depth discussions with machine-learning experts, you can find the rest of our interview series here.

Lionbridge AI: How did you become interested in natural language processing?

Federico: Completely by accident! Raúl Garreta, the CEO and co-founder of MonkeyLearn, got in touch because he was looking for a business partner. He already had solid experience in machine learning and was looking for someone who would be able to leverage his experience and know-how. I had successfully run and executed digital marketing strategies for a diverse set of companies in the past, so a mutual friend recommended that Raúl reach out to me.

Prior to our first meeting, I didn’t know anything about machine learning. However, as I conducted my research, I found myself growing more and more interested in the potential of NLP – and of this new startup. Machine learning is going to have a significant impact in years to come and MonkeyLearn has a crucial role to play in democratizing the technology. I couldn’t be more excited to be involved.

L: Looking at MonkeyLearn’s use cases, your machine-learning models could incorporate anything from text categorization to entity extraction. What were your goals when creating your software and how did you approach the task from a technical perspective?

F: For the backend, we work predominantly with Python and frameworks like Django, Scikit-learn, SpaCy and NLTK. We’ve also developed our own implementations for different aspects of our platform, such as data preparation, algorithms, model training and evaluation.

For the front-end we invest a lot in usability. MonkeyLearn’s goal is to create software that’s accessible for everyone. We want anyone to be able to create their own machine learning model for text analysis, even if they have zero knowledge about machine learning, NLP, or even programming.

To this end, we provide an API that users can integrate using the most popular programming language SDKs, as well as enabling third-party integrations such as Zapier, Google Sheets, Zendesk and Rapidminer. We hope that people of all technical abilities will be surprised at how easy it is to create and train their own machine learning models when using our software. Thanks to our user interface and active learning capabilities, our customers can train their models in seconds, rather than hours or days, which is a real game-changer for them and their business.

L: Text categorization projects often come with unique guidelines attached. How do you strike the balance between offering a scalable, general service and serving the specific needs of your clients through customization? Is there a technological approach to solving this problem?

F: We offer various pre-trained models for diverse tasks so that our customers can start using machine learning for text analysis from the get-go. For example, our pre-trained model for sentiment analysis delivers state-of-the-art accuracy and can be used for analyzing a wide range of text data.

However, we know that every problem is unique and, to achieve the best results, businesses need to train their own machine learning models. That’s why we developed an easy-to-use interface, so that businesses can train their own classifier or extractor, using their own data, tags and criteria.

In general, customers start out with a pre-trained model and, once they feel confident using MonkeyLearn, gradually move on to their own customized models to get more accurate and granular results.

L: As you build these custom models, you must see a wide variety of training datasets. Where do your customers source their data and what does it usually look like?

F: There are two types of training data our customers use for training models on our platform – internal and external. Internal data is the information that they already have within their company. This could be a database, data collected from emails, surveys, or even customer support tickets. External data is any data that’s been made public by other organizations or that is available on the internet, such as product reviews, social media, or news articles.

We’ve made it simple to import this data to MonkeyLearn. Users can upload training data using various data sources without needing to code. For example, they can upload data in a CSV or Excel file, or they can import data from a range of third-party apps. Alternatively, those who know how to code can import data via MonkeyLearn’s API.

L: Which practices do you advise your clients to adopt in order to get the best ROI from their training data?

F: I’d go as far as saying that the data used to train a model is more important than the algorithm a model uses for training. Firstly, it’s important that a dataset is highly relevant, so that businesses can gain insights that really matter to them.

 

In order to extract maximum value from a dataset and build a competent machine-learning model, well-defined tags are absolutely crucial.

 

We advise our clients to train their models using no more than 15 tags so that their annotators can be consistent with their tagging.

The number of texts needed for training a model will depend on various factors, such as the complexity of what needs to be achieved and the number of tags involved. For example, classification models used to detect topics need around 250 examples of text per tag (topic) to obtain a good accuracy level. In comparison, sentiment analysis needs at least 500 examples per tag (sentiment) to produce good results.

Having said that, quality always triumphs over quantity. If you train models using incorrectly tagged text, then it will commit those mistakes to memory and replicate them when fed new text data. As a result, if you don’t have a lot of time to spare tagging data, we always recommend using a smaller dataset that has been accurately tagged.

L: There are an increasing number of companies who are looking to incorporate natural language processing into their business models. Which areas of a business provide the greatest ROI for a custom text analysis model?

F: Currently, we have customers in various industries using NLP to automate processes, obtain insights and save hours of manual data processing. In our experience, the areas that have the highest ROI are customer support and customer feedback.

In customer support, companies are using NLP to automate a variety of processes, such as tagging and routing tickets to the appropriate team and gaining insights from customer conversations. By doing this, businesses have been able to determine the level of urgency of any given ticket, prioritize them accordingly, and prevent customer complaints spiraling out of control.

NLP can also be used for processing customer feedback. For example, imagine how often a software company sees an NPS answer that mentions something about ‘reliability’ or ‘bug issues’. This type of issue may require a quick fix, and it can be resolved much faster if detected. Machine learning models can detect these type of responses in real time and send them to the right teams, so they’re able to deal with them in the most efficient way possible. As a result, businesses can drastically improve customer experiences and lower their churn rates.

L: What is the next step in the development of text-mining models?

F: The increasing volume of data, the rise in computing power, and advances in deep learning are all contributing to big improvements in the field. The more data is available, the more information we have to train better machine learning models. Deep learning algorithms are also being used to obtain better vector representations for words, such as Word2Vec or GloVe. They’re also improving the accuracy of classifiers trained with traditional machine learning algorithms, which is an exciting development.

 

In the future, NLP will be able to do so much more than understand sentence structure and determine the content of texts.

 

It will be able to understand the intended meaning of a word, making NLP systems capable of understanding a text like we humans do.

L: Finally, do you have any more advice for people looking to build their own natural language processing models?

F: It’s best to start small. Many companies decide overnight that they want to start using machine learning or NLP, and then want to train as many models as possible to automate processes within their company. However, if a business decides to train lots of models at once, it’s only a matter of time before employees become overwhelmed and frustrated, leading to incomplete and inaccurate models that deliver poor results. This is the opposite of what machine learning is intended for. Instead, it’s better to have one model with a high accuracy rate than lots of models that are mediocre at understanding text.

To start with, focus on one area of interest within your business, and train one or two models to run a low-level analysis. For example, you might want to know how customers feel about a new product and/or service, which can be achieved by training a sentiment analysis model to split customer feedback into negative and positive. This will allow you to quickly garner insights on prevailing sentiments about their new product. Once you’ve got the hang of creating and training models, you’ll naturally transition towards models that run a more complex analysis, such as aspect-based sentiment analysis.

My second piece of advice is to focus on the quality of data. It is just as, or more, important than the amount of data you use. Of course, the quantity of data is important, since the more data you use to train a model the more accurate it will be. However, there’s no point feeding your model large quantities of poorly tagged data. At MonkeyLearn, we prefer to use 1,000 examples of well-labeled training data over 20,000 examples of mislabeled training data that isn’t representative of the insights you’re trying to obtain. Finally, remember to keep tags to a minimum. A long list of tags is hard enough for us to fathom, let alone a machine!

The Author
Daniel Smith

Daniel writes a variety of content for Lionbridge’s website as part of the marketing team. Born and raised in the UK, he first came to Japan by chance in 2013 and is continually surprised that no one has thrown him out yet. Outside of Lionbridge, he loves to travel, take photos and listen to music that his neighbors really, really hate.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.