Ivan Vulić is a Senior Research Associate in the Language Technology Lab at the University of Cambridge. His PhD at KU Leuven focused on a range of problems related to cross-lingual NLP and multilinguality. He is one of the most prolific authors currently working in NLP, having contributed to over 50 top publications in the field so far. In industry, Ivan is also the Senior Scientist for PolyAI, a London-based startup working on cutting-edge conversational AI solutions.
Our discussion with Ivan emphasized the importance of developing NLP in multiple languages, while also detailing the progress PolyAI is making towards conversational models that can work in a variety of sectors. For more expert analysis of the field of machine learning, check out the rest of our interview series here.
Lionbridge: How did you become interested in natural language processing?
Ivan: I became involved in NLP way before the field was so popular – that is, before the dawn of deep learning and neural networks. In fact, my start in NLP was a complete coincidence. During the last year of my Masters study at the University of Zagreb, Croatia, I was investigating options for a PhD, and wanted something that would combine mathematics, programming, statistics, and language. Before that point I had no idea that the world of NLP even existed! I was excited to discover that it might be the thing that united all of my interests.
I made a snap decision to apply for a PhD at KU Leuven, moved to Belgium and began my PhD by opening the Jurafsky and Martin textbook at page 1 – this is not a figure of speech! I then gradually improved my NLP skills by reading tons and tons of research papers, as well as all the machine learning and NLP-related books I could find in the first few months. Looking back, I had a pretty non-standard path into the field. I feel lucky that I ended up in an excellent research group with a supervisor that really believed in me, despite the fact that I literally started from ground zero.
L: What are the key areas that your research at Cambridge focuses on?
I: Broadly speaking, I specialize in cross-lingual and multilingual NLP, and all sorts of problems related to it. This has included building universal models of meaning that can enable knowledge transfer across different languages, designing language technology tools for under-resourced languages, and enabling conversational AI in multiple languages. My overarching goal is to teach machines to understand and speak a wide variety of languages.
Different languages come with their own sets of language-specific problems, which is what makes this such a difficult and exciting area of research.
At Cambridge I have been working on some of the fundamental problems of cross-lingual NLP. For example, I’m figuring out how to learn cross-lingual text representations that all live in a shared semantic space and enable cross-lingual transfer, how to do this with limited supervision, and how to do it for languages with different typological properties.
I’ve also done some work on language modeling, with a focus on morphologically complex languages. While language modeling is becoming increasingly more important, I still feel like we’re focusing too much on the English language, so one could say that I’m also trying to raise awareness about other languages. If something works for English with massive amounts of data, it does not mean that it will work equally well for other languages. Therefore, I’m trying to come up with new techniques that will also advance our representation learning methods in other languages.
Further, current mainstream modeling paradigms are all obsessed with trying to learn everything end-to-end from the text data using only data-hungry neural net machinery. However, I feel there’s still invaluable information in other more structured sources created over the years (e.g., WordNet, knowledge bases, dictionaries, thesauri) which we can use to complement our data-driven methods. Another aspect of my work at Cambridge is finding better and better ways to merge that external information into data-driven methods and use these hybrid (or externally informed) models in language understanding applications.
Besides these research threads, I’m also interested in cognitive processing of human language and dialogue – but perhaps that’s a discussion for another time!
L: As someone with an extensive background in academia, what was it about the PolyAI business venture that made you want to get involved?
I: I knew all three co-founders from our Cambridge days, and actively collaborated with PolyAI CEO Nikola Mrkšić on several research papers during the final year of his PhD. When Nikola, Eddy, and Shawn came up with the idea of starting a company, I simply had to get involved. Joining PolyAI meant working with world-class researchers on interesting problems, as well as creating solutions that can make a difference on a global scale. I knew that joining PolyAI would allow me to get first-hand business experience, learn something new every day, and grow professionally. As their first employee, it’s been amazing to witness the growth of the company and to work towards ambitious long-term plans with smart, energetic people.
L: PolyAI aims to build a conversational AI that is ‘sector agnostic’. How do you balance the more general ambitions of such an AI with the specific needs of each particular customer?
I: Our goal is to build versatile tools that can support multiple sectors by minimizing the tools’ adaptation requirements. Of course, building an omnipotent AI tool that can see a new domain and immediately adapt (without telling it how) is an AI-complete problem that’s beyond our reach. However, our approach is still extremely ambitious. It’s also pragmatic and realistic, since it actually aligns quite well with some of the most recent trends in NLP.
In short, our approach to dialogue can be seen as a combination of universal pre-training and domain-specific fine-tuning. The model first learns to converse in the general domain before we further teach it how to specialize for a particular sector. Before the sector-based adaptation, we still must understand and describe the sector. We get this information, as well as the main goals for the model, from our customers. This information is then turned into more explicit ‘semantics’ which define the sector. Put simply, after defining the specs we can instruct our system to adapt to it. However, we can do this only if we have the right data. We collect in-domain data tailored for the particular customer’s needs, which helps us to adjust and fine-tune our universal model to that sector. In other words, we balance between a general/universal model that gets trained on massive amounts of heterogeneous data and its specialisation to different sectors by fine-tuning it with much smaller amounts of in-domain data. This approach has been extremely successful for us across diverse sectors.
Alongside our powerful sentence encoders and NLP machinery, the final magic ingredient in our conversational AI pipeline is our novel data collection paradigm. Quite simply, there is no good AI without good data. Our credo at PolyAI is: data first, models and serenity later.
L: Your platform is also available in a variety of different languages. How do you ensure that your solution communicates appropriately for a range of different cultures, rather than simply translating stock answers?
I: Our solution is to come up with a design that will bypass the main obstacles faced by many modern dialogue systems when it comes to data collection and scaling to other languages and domains. Standard systems need dialogue data annotated with a set of low-level dialogue acts, slots and values that come from pre-built task ontologies. These ontologies and annotated datasets are expensive and time-consuming to create, even for simpler domains such as restaurant search or flight booking. On top of this, it is non-trivial to get the data for a variety of languages.
At PolyAI, we’ve learned the hard way how difficult these problems are. There are so many different points at which a conversational system can fail. In particular, we don’t believe that decision-tree style conversational flows and full-blown rule-based approaches are the way to go. These don’t scale well for different tasks and they need to be built from scratch when we want to deploy the system for another language. That sounds like a pretty tedious job!
Instead, our solution is what we call “orchestrating conversations.” The core technology behind the PolyAI platform is a general-purpose conversational search engine, paired with another module we call content programming. First, our conversational search engine was trained on billions of past conversations to learn to resolve context, identify important social cues in dialogues, and select the most plausible response from a pool of possible answers. By framing dialogue as a conversational response task, we don’t need dialogue annotations, complex natural language generation components, or even dedicated decision-making policy modules.
In the second step, we do what we call content programming: feeding the pre-trained response selection engine with in-domain data and adapting to a particular task or purpose. The same principle can be followed for a variety of languages. All we need are large datasets of conversational examples in those languages, and we can get away with fewer data points for training our conversational search engine by relying on crude MT-based translations of large datasets.
We have found that this simple approach works much better in production than a bunch of bleeding-edge methods taken from the most recent academic papers.
The power of our models comes from the data, and all the conversational, contextual, cultural cues contained within it.
We just need to find a way to extract them via our conversational search engine and fine-tune them via content programming.
L: With such complex goals in multiple languages, you must have equally complex training data needs. Are there any particular features of a dataset which indicate that it will be useful for a project of your size and scope?
I: The answer to this question is very much related to what I just explained. When it comes to general-domain pretraining, it seems that a “more the merrier” approach works well. For instance, we have recently released a repository of large conversational datasets that span filtered Reddit data in the period between 2015 and 2018. The dataset is massive: even after filtering it contains more than 727 million context and reply pairs. However, it’s only good for task-unrelated pretraining. If we want to optimize our system for a particular task, we still need in-domain data that is closely related to the task we’re trying to solve. Although our “more the merrier” assumption still holds, this task-specific fine-tuning is done with much smaller datasets. There is always a fine line between data abundance and data quality. Our idea is to adapt our approach to leverage both types of data.
L: As a previous user of Gengo’s services, how did you find that they benefited your research?
I: We used Gengo in one of our academic projects two years ago where we translated a well-known English dialogue dataset, Wizard-of-Oz (WOZ) 2.0, into German and Italian. The whole translation process was very easy to handle and coordinating everything with Gengo was as smooth as it gets. We were very satisfied with the final outcome. I am quite happy to say that our multilingual WOZ dataset is still used today and has inspired some follow-up research on multilingual dialogue.
L: It’s clear to see that PolyAI are big believers in voice technology. What are some of the biggest barriers to the development of this tech, and have you seen any progress towards solving these issues?
I: We believe that it’s possible to build conversational AI tools that can assist humans in various parts of their daily routine. After all, user-friendly, voice-controlled interfaces are something that we already have in our homes with, say, smart TVs or even smart ovens. At PolyAI, we want to go several steps beyond these simple applications and really build conversational systems that can hold conversation for multiple turns.
However, dialogue is a notoriously difficult problem. Even defining the problem isn’t straightforward, since so many subtle factors have to be taken into account during modeling. For instance, the coverage of the conversation is practically infinite, which is a scary statement to begin with. That’s why we’ve chosen to first focus on task-oriented dialogue. Rather than try to build one omnipotent dialogue system that can have an intelligent conversation about life, the universe, and restaurant search, we focus instead on building a multi-domain task-oriented system that is good at multiple useful tasks, but can’t have an in-depth chat about anything and everything.
There are also other, less philosophical obstacles to overcome here. Speech recognition and synthesis is one research area where we need further improvement, especially when dealing with other languages. Even then, what about natural language understanding and generation? What about decision making? There are so many interesting problems to work on at the moment.
L: As a prolific author of NLP papers, are there any particularly exciting research developments that you’ll be keeping an eye on over the next year?
I: I must admit that I am both excited and a little neurotic about the huge influx of new papers on arxiv. It’s great to see that something as fundamental as language modeling is actually making a breakthrough in NLP again. It seems like language models are coming back with a vengeance.
Personally, I would like to see more interesting language modeling and language understanding work for other languages. We should abandon our English-centric bias in NLP and start working with other languages much more. This year, I hope we see an increasing amount of training and evaluation data available for other languages, and more modeling efforts where we combine language-agnostic base architectures with some language-specific knowledge towards language-fine-tuned models. I’d love to see more efforts to develop language-invariant multilingual pretraining processes that mitigate language-specific biases and create a platform for further, language-specific fine-tuning. This might look similar to PolyAI’s approach to conversational response, which combines a coarser search step with a fine-tuning content programming step.
L: Finally, do you have any further advice for anyone looking to build their own NLP model?
I: Apart from a bag full of cliches such as “think out of the box” or “come up with new applications”, one piece of advice is: understand your data requirements and your data. All your intricate modeling efforts are in vain if they cannot learn useful patterns from the data.
I believe that the whole NLP community is still focusing too much on shiny new models instead of coming up with better training and evaluation datasets, as well as better defined tasks.
When the community does focus on data, it often has a big impact. For instance, some datasets that moved the entire field forward are the Penn Treebank, ImageNet in computer vision, 1B words language modeling benchmark, DSTC and (Multi)WOZ for dialogue, Universal Dependencies, etc.
After collecting a good data sample, always ask the same set of questions when building your own NLP model:
- What do I know about the problem I am trying to solve?
- Why is my approach suitable for that particular problem? What are the main assumptions behind my modeling approach?
- What can I learn from prior work and what can I improve?
- Is my approach only an incremental improvement or am I doing something really novel here?
- Am I comparing with strong and meaningful baselines and is the whole evaluation protocol sound and fair?
- How can I improve on the basic prototype/configuration of the model?
When it comes to good coding practices, there was a great tutorial at EMNLP 2018 on coding for NLP from the Allen AI people. I would advise you to take a look.