14 Best Chinese Language Datasets for Machine Learning

Article by Rei Morikawa | January 22, 2020

One of the main challenges in building multilingual machine learning models is collecting enough relevant data. To help, we at Lionbridge have compiled a list of the best publicly available Chinese language datasets. These datasets cover a wide range of use cases, from handwritten data for Chinese OCR to labeled text data for sentiment analysis.

If you missed our previous language dataset compilations, be sure to check out our other dataset articles. Without further ado, here are the best Chinese data sources for machine learning projects.

 

Best Chinese Datasets for Machine Learning

 

Chinese Text Datasets

 

Chinese OCR & Handwriting Datasets

  • Chinese Characters: A dataset of handwritten Chinese characters containing 909,818 images that corresponds to about 10 news articles.
  • Chinese Characters Generator: This fonts file is able to generate Chinese character images which can be used for training a Chinese OCR system.
  • Text in the Wild: Using street view images, this dataset contains samples of about one million Chinese characters annotated by experts in over 30,000 pictures. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes.
Chinese OCR Data - Text in the Wild
Example of Chinese Text in the Wild Dataset

 

Chinese Translation & Parallel Text Datasets

  • Chinese-English Emails: Contains 15,000 characters in Chinese (equivalent to 10,000 words) from emails, and a reference translation in English.
  • OntoNotes: Annotated corpus containing various genres of text – news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows – in Chinese, English, and Arabic.
  • NUS Corpus: This corpus was created for social media text normalization and translation. The researchers randomly selected 2,000 messages from the NUS English SMS corpus and translated them into formal Chinese.
  • Chinese-French Text: This dataset contains French translations of approximately 30,000 characters from Chinese Broadcast News.
  • GALE Phase 1 Chinese Blog Parallel Text: Also from the LDC, this dataset contains 277 Chinese blog posts translated into English.

 

Chinese Sentiment Analysis Datasets

  • Ren-CECps: This dataset includes 1,500 blog posts (11k paragraphs, 35k sentences) with annotations of emotion and sentiment at document paragraph, and sentence levels.
  • Microblog PCU: From researchers at Xi’an Jiaotong University, this dataset has 50,000 posts from Sina Weibo, and includes user metadata, including following-follower information.

 

Still can’t find what you need? Lionbridge AI provides custom multilingual datasets in 300 languages. Our community of over 1 million certified contributors can quickly collect, create, and annotate training data for your machine learning model.

Interested? Get multilingual AI training data now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.