13 Free Japanese Language Datasets for Machine Learning

Article by Rei Morikawa | October 14, 2019

We at Lionbridge AI are continuing our article series on machine learning datasets, and in this blog post, we’ll share 13 free Japanese language text datasets for machine learning.

 

Dataset Finders

  • DATA GO JP: The Japanese government’s catalogue site provides public datasets as part of its mission to improve the economy and standard of living for Japanese citizens.
  • National Information Research Data Repository: This site includes datasets that Japan’s National Information Research Group is currently working on, or preparing to work on in the near future.
  • Link Data: Support site where you can convert table data into RDF files and make them public.

 

Japanese Datasets for Natural Language Processing

  • Resources for Natural Language Processing: Datasets for natural language processing, provided by Kyoto University. For example, there are annotated datasets of online text literature and articles from Mainichi Shinbun, a major Japanese newspaper.
  • Aozora Book Collection: Online books in text, xhtml, and html form, with the author’s permission. You can also download the dataset on GitHub.
  • Aozora Book Collection Morphological Analysis Data: Dataset of 11,176 pieces from the Aoyama Book Collection that have undergone morphological analysis. You can use this for business purposes with a CC license.
  • Kanjivg-radical: Dataset of Japanese kanji and the different parts that make up kanji. For example, the Japanese kanji [脳] is made up three different parts: [月] [⺍] [凶]. You can use this dataset to search for Japanese kanji that you don’t know how to read, based on the parts.

 

Japanese Parallel Text Datasets

  • Japanese Parallel Text Data: List of language resources that you can use to train a Japanese machine translation system. The list mostly includes resources for Japanese/English translation, but there are several multilingual resources available too at the end.
  • SNOW T15 Japanese Simplified Corpus with Core Vocabulary: The creators took Japanese/English parallel text corpus and translated the Japanese into easy-to-understand, plain Japanese.

 

Japanese Datasets for Sentiment Analysis

 

Other Japanese Language Datasets

Still can’t find what you’re looking for? With 20 years of translation experience and 500,000 qualified translators around the world, language datasets are Lionbridge’s strength. We provide custom datasets that match the needs of your machine learning datasets in 300 different languages. Contact us to find out how we can support your machine learning project.

The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.