One of the common challenges to building a multilingual machine learning model is collecting enough relevant data. To help, we have compiled a list of Chinese language datasets for machine learning. These datasets can cover a wide range of use cases, from optical character recognition to sentiment analysis.
- Chinese Treebank: Approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, and various broadcast news
- Mandarin Chinese News Text: 250 million Chinese character corpus of news text from People’s Daily, Xinhua newswire, and China Radio International.
- Tencent AI Lab Embedding Corpus of Chinese Words and Phrases: This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale, high-quality data.
- Large Scale Chinese Short Text Summarization Dataset: This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text.
OCR & Handwriting Datasets
- Chinese Characters: A dataset of handwritten Chinese characters containing 909,818 images that corresponds to about 10 news articles.
- Chinese Characters Generator: Chinese fonts dataset which can be used for Chinese text OCR.
- Text in the Wild: Dataset of Chinese text with about one million Chinese characters annotated by experts in over 30,000 street view images. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes.
Translation & Parallel Text Datasets
- Chinese-English Emails: Contains 15,000 characters in Chinese (equivalent to 10,000 words) from emails, and a reference translation in English.
- OntoNotes: Annotated corpus containing various genres of text – news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows – in Chinese, English, and Arabic.
- NUS Corpus: This corpus was created for social media text normalization and translation. The researchers randomly selected 2,000 messages from the NUS English SMS corpus and translated them into formal Chinese.
- Chinese-French Text: French translations of a subset of approximately 30,000 Chinese characters from Chinese Broadcast News.
- GALE Phase 1 Chinese Blog Parallel Text: This dataset contains 277 Chinese blog posts translated into English.
Sentiment Analysis Datasets
- Ren-CECps: This dataset includes 1,500 blog posts (11k paragraphs, 35k sentences) with annotations of emotion and sentiment at document paragraph, and sentence levels.
- Microblog PCU: From researchers at Xi’an Jiaotong University, this dataset has 50,000 posts from Sina Weibo, and includes user metadata, including following-follower information.
Still can’t find what you need? Lionbridge AI provides custom multilingual datasets in 300 languages. Our 500,000 certified contributors can quickly collect, create, and annotate training data for your machine learning model.