14 Best Chinese Language Datasets for Machine Learning

Article by Rei Morikawa | October 14, 2019

One of the common challenges to building a multilingual machine learning model is collecting enough relevant data. To help, we have compiled a list of Chinese language datasets for machine learning. These datasets can cover a wide range of use cases, from optical character recognition to sentiment analysis.

 

If you missed the previous language dataset compilations, be sure to check out our Japanese language datasets, German language datasets, and more. 

 

Text Datasets

 

OCR & Handwriting Datasets

  • Chinese Characters: A dataset of handwritten Chinese characters containing 909,818 images that corresponds to about 10 news articles.
  • Chinese Characters Generator: Chinese fonts dataset which can be used for Chinese text OCR.
  • Text in the Wild: Dataset of Chinese text with about one million Chinese characters annotated by experts in over 30,000 street view images. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes.

 

Translation & Parallel Text Datasets

  • Chinese-English Emails: Contains 15,000 characters in Chinese (equivalent to 10,000 words) from emails, and a reference translation in English.
  • OntoNotes: Annotated corpus containing various genres of text – news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows – in Chinese, English, and Arabic.
  • NUS Corpus: This corpus was created for social media text normalization and translation. The researchers randomly selected 2,000 messages from the NUS English SMS corpus and translated them into formal Chinese.
  • Chinese-French Text: French translations of a subset of approximately 30,000 Chinese characters from Chinese Broadcast News.
  • GALE Phase 1 Chinese Blog Parallel Text: This dataset contains 277 Chinese blog posts translated into English.

 

Sentiment Analysis Datasets

  • Ren-CECps: This dataset includes 1,500 blog posts (11k paragraphs, 35k sentences) with annotations of emotion and sentiment at document paragraph, and sentence levels.
  • Microblog PCU: From researchers at Xi’an Jiaotong University, this dataset has 50,000 posts from Sina Weibo, and includes user metadata, including following-follower information.

 

Still can’t find what you need? Lionbridge AI provides custom multilingual datasets in 300 languages. Our 500,000 certified contributors can quickly collect, create, and annotate training data for your machine learning model.

Interested? Get multilingual AI training data now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.