10 Best Korean Language Datasets for Machine Learning

Article by Alex Nguyen | August 22, 2019

Diverse AI training data is imperative to building multilingual machine learning models, especially for morphologically complex languages like Korean.

Because finding enough relevant data in Korean is difficult, we at Lionbridge have put together a comprehensive list of public Korean datasets for machine learning.


Korean Text Datasets

KAIST Korean Corpus: A collection of Korean corpora available upon request, including 

National Institute of the Korean Language Corpus: This dataset contains frequency information on Korean, which is spoken by 80 million people. For each item, both the frequency (number of times it occurs in the corpus) and its relative rank to other lemmas is provided.

Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment lexicons for 81 languages, including Korean.

Korean Hate Speech Data: Hate Speech Comments from the Korean Radical Anti-male Website, Womad.

Old Newspapers: Created for a language identification task for similar languages, this corpora contains natural language text from various newspapers, blogs and social media posts in multiple languages, including Korean.


Korean Parallel Text Corpora

1000 parallel sentences: 1000 parallel sentences of Korean, English, Japanese, Spanish, and Indonesian. Sentences were created using the most frequent words in Korean.

Korean-English parallel corpus: This dataset contains 700 training and 700 test sentences in Korean and English. The text spans various topics including news articles, short stories, letters and advertisements.


Korean Audio Datasets

Korean Single Speaker Speech Dataset: KSS Dataset is designed for the Korean text-to-speech task. It consists of audio files recorded by a professional female voice actress and their aligned text extracted from books.

Zeroth-Korean: This dataset contains 51.6 hours of training data (22,263 utterances, 105 people, 3000 sentences) and 1.2 hours of test data (457 utterances, 10 people).

Pansori-TEDxKR: This is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers.


Still can’t find what you need? Lionbridge has extensive experience creating custom annotated data in Korean for a wide range of machine learning applications. With 20 years of experience and 500,000+ qualified native speakers around the world, multilingual datasets are Lionbridge’s strength.

Interested? Get high-quality data now
The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.