15 Best Chatbot Datasets for Machine Learning

15 Best Chatbot Datasets for Machine Learning
Article by Alex Nguyen | July 03, 2019

An effective chatbot requires a massive amount of training data in order to quickly solve user inquiries without human intervention. However, the primary bottleneck in chatbot development is obtaining realistic, task-oriented dialog data to train these machine learning-based systems.

We’ve put together the ultimate list of the best conversational datasets to train a chatbot, broken down into question-answer data, customer support data, dialogue data and multilingual data.

 

Question-Answer Datasets for Chatbot Training

Question-Answer Dataset: This corpus includes Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.

The WikiQA Corpus: A publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, they used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.

Yahoo Language Data: This page features manually curated QA datasets from Yahoo Answers from Yahoo.

TREC QA Collection: TREC has had a question answering track since 1999. In each track, the task was defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain, closed-class questions.

 

Customer Support Datasets for Chatbot Training

Ubuntu Dialogue Corpus: Consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The full dataset contains 930,000 dialogues and over 100,000,000 words

Relational Strategies in Customer Service Dataset: A collection of travel-related customer service data from four sources. The conversation logs of three commercial customer service IVAs and the Airline forums on TripAdvisor.com during August 2016.

Customer Support on Twitter: This dataset on Kaggle includes over 3 million tweets and replies from the biggest brands on Twitter.

 

Dialogue Datasets for Chatbot Training

Semantic Web Interest Group IRC Chat Logs: This automatically generated IRC chat log  is available in RDF, back to 2004, on a daily basis, including time stamps and nicknames.

Cornell Movie-Dialogs Corpus: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.

ConvAI2 Dataset: The dataset contains more than 2000 dialogues for a PersonaChat competition, where human evaluators recruited via the crowdsourcing platform Yandex.Toloka chatted with bots submitted by teams.

Santa Barbara Corpus of Spoken American English: This dataset includes approximately 249,000 words of transcription, audio, and timestamps at the level of individual intonation units.

The NPS Chat Corpus: This corpus consists of 10,567 posts out of approximately 500,000 posts gathered from various online chat services in accordance with their terms of service.

Maluuba Goal-Oriented Dialogue: Open dialogue dataset where the conversation aims at accomplishing a task or taking a decision – specifically, finding flights and a hotel. The dataset contains complex conversations and decision-making covering 250+ hotels, flights, and destinations.

Multi-Domain Wizard-of-Oz dataset (MultiWOZ): A fully-labeled collection of written conversations spanning over multiple domains and topics. The dataset contains 10k dialogues, and is at least one order of magnitude larger than all previous annotated task-oriented corpora.

 

Multilingual Chatbot Training Datasets

NUS Corpus: This corpus was created for social media text normalization and translation. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus and then translated into formal Chinese.

EXCITEMENT Datasets: These datasets, available in English and Italian, contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company.

 

Still can’t find the data you need? Lionbridge AI provides custom chatbot training data for machine learning in 300 languages to help make your conversations more interactive and supportive for customers worldwide. Contact us today to learn more about how we can work for you.

The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.