How to Build a Social AI: An Interview with NLP Researcher Thomas Wolf

Article by Daniel Smith | January 11, 2019

Thomas Wolf is the Chief Science Officer at Huggingface, a chatbot startup aiming to create the first truly social AI. With a PhD in Statistical / Quantum Physics from Pierre and Marie Curie University, as well as a Law degree from Paris Sorbonne University, Thomas brings an interdisciplinary approach to his current research in natural language processing. Outside of Huggingface, he can be found discussing all things machine learning on his blog, Medium or on Twitter.

Our discussion with Thomas dove into the unique challenges presented by building an AI for social purposes, before moving on to some of the recent, exciting developments in NLP. For more interviews with experts working in machine learning, check out the rest of our interview series here.

Lionbridge AI: How did you first get involved in natural language processing?

Thomas: I’ve been programming since I was 12, but I actually began my career in physics rather than computer science. After graduating, I went to Berkeley to do research on laser-plasma interactions before starting a PhD in Statistical/Quantum physics in Paris. I then switched directions and obtained a law degree from Paris Sorbonne University. I worked as a European Patent Attorney for 6 years, helping a portfolio of startups and big companies to build and defend their intellectual property assets. In 2015/2016, I was advising a lot of deep learning startups, which gave me an introduction to the field of ML/AI. I quickly realized that most of the maths involved actually comes from statistical physics and fell in love with machine learning.

I had probably encountered natural language processing as a teenager, but it was only at this point that I fully dived into the topic. My way of learning is to read a lot of textbooks, starting from the classics and working towards recent publications, so this is what I did with computational linguistics and machine learning. It was particularly helpful to consider the old work in computational linguistics from a modern perspective. For example, I learned a lot through reimplementing H.P. Grice’s work with neural nets. Shortly after this, one of my friends asked me if I wanted to join the startup he just founded in New York. Now I’m here working in science again and having a lot of fun!

L: Which problems are you focused on solving through your work at Huggingface?

T: We are working on natural language generation and natural language understanding in the context of language generation. We are deeply focused on open-domain conversation and long term relationships with humans. We want our product to be like a pet dog or cat that could talk!

L: Social messaging includes a range of unique, rapidly evolving language features, such as slang and emojis. How did you account for these during training and did it change your approach to building a chatbot?

T: That was one of the first applied questions I had to tackle when I joined Huggingface and is one of the defining features of our dataset and user base. Our users are mostly millennials who use emojis as a major vehicle for communication. They’re also constantly reinventing English through things like slang and new speech patterns, which means that resources like Urban Dictionary are essential for us. We tackle this problem with a mix of various approaches ranging from rule-based typo correction and acronym expansion to character-level neural network models such as ELMo and, obviously, through training models on our own datasets.

L: Training a social chatbot must require some pretty varied and interesting data. What would your perfect dataset look like?

T: Our perfect dataset is the one we are creating! Although we do use external datasets like Reddit and crowdsourcing providers when we have the need, we now have over 400 million messages in our database so we can train and fine-tune our models on our own dataset. Although it sounds like I’m trying to avoid answering the question, the unique needs of our chatbot mean that we have a dataset that doesn’t really exist elsewhere.


We aren’t trying to replicate human-to-human conversation, since our AI is not faking a human partner. Instead, it’s designed from scratch to be a different kind of intelligence.


We are therefore free from trying to reproduce human behavior and can explore all the directions in which a fun and engaging interaction with a human can be obtained, using all of the amazing developments in AI and ML that have happened over the last few years. By the way, if you doubt that a fun and engaging interaction with a human is possible without copying human intelligence, then you probably never had a pet!

L: Are there any interesting, unexpected challenges that you’ve encountered working on the app?

T: So many! One of the most surprising to me is the kindness that users show to our AI and the desire they have to help it learn. One unexpected challenge is that good engagement metrics can sometimes be correlated with the bad behavior of the AI that users are trying to correct by helping the AI to learn how to act properly.

L: What are some exciting research developments and industry use cases that you’ll be keeping an eye on over the next year?

T: Transfer learning in NLP is undergoing a revolution right now and is changing everything about the way people do research and put models into production. This can be seen in areas such as dataset creation, where the SWAG dataset was solved by a transfer learning approach before it was officially released. It’s also present in the way we use and develop algorithms, through pre-training followed by the fine-tuning of bigger and bigger models.

L: Do you have any advice for anyone out there who’s looking into building an NLP algorithm?

T: One of the most positive things happening in the current AI revolution is the strong incentive for public and private labs to publish their research and open-source their code base. It really is a game changer for startups and individual researchers. If you’re building an NLP algorithm today, don’t do it on your own! Start from the best open-sourced algorithms in your field of interest and then give back to the community by open-sourcing your improved algorithm and ideas. Execution is what’s important in the end. Building a great product is the key to growing your user base, and open-sourcing your algorithms helps the whole community to grow.

The Author
Daniel Smith

Daniel writes a variety of content for Lionbridge’s website as part of the marketing team. Born and raised in the UK, he first came to Japan by chance in 2013 and is continually surprised that no one has thrown him out yet. Outside of Lionbridge, he loves to travel, take photos and listen to music that his neighbors really, really hate.


Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.