Top 10 Vietnamese Text and Language Datasets

Article By Lucas Scott | December 13, 2019

Access to open source language data can be difficult to find, especially text data in multiple languages. For those in search of Vietnamese text data, this article introduces ten Vietnamese datasets for machine learning. 

The following list includes Vietnamese parallel text datasets, semantic datasets, and lexicons. They can be used for training machine translation models and sentiment analysis models, among other AI applications.

 

Vietnamese Text, Lexicon, and Language Datasets

1. HC Corpora Newspapers – The HC Corpora is a large dataset of natural language text taken from newspapers, blogs, and social media. This dataset includes only the text data taken from the newspapers portion of the HC Corpora. However, the data comes in 67 languages, including Vietnamese. 

2. Vietnamese Dictionary for Model Transformation – This Vietnamese dictionary dataset includes an expansive Vietnamese lexicon in CSV format. The dataset includes over 29,000 rows of data in three columns: diacritical marks, no accents, and word meaning. 

3. Sentiment Lexicons for 81 Languages – For use in building sentiment analysis models, this dataset features positive and negative sentiment lexicons in Vietnamese, as well as 80 other languages. 

4. ViCon and ViSim-400 – ViCon and ViSim-400 are Vietnamese language text datasets and were developed for use in the evaluation of semantic models. ViCon contains numerous pairs of synonyms and antonyms, while ViSim-400 provides pairs of words that are semantically related. 

5. Na Meo, a Hmong-Mien Vietnamese Language – This dataset explores Na Meo, a Vietnamese language spoken by only a few villages in Northern Vietnam. The data includes 400 Na Meo lexical items as well as comparisons with other dialects.

6. Vietnamese Song Corpus – This Vietnamese audio dataset contains 20 Vietnamese songs originally used in the paper “Tone-melody correspondence in Vietnamese popular song” by Kirby and Ladd at the University of Edinburgh. 

 

Vietnamese Parallel Text Datasets

7. English-Vietnamese Dataset – With over 100,000 rows of data, this text dataset includes English and Vietnamese parallel text. 

8. Japanese-Vietnamese Lexicon – This Japanese and Vietnamese text dataset includes over 54,000 words in CSV format with the following columns: diacritical marks Vietnamese, no accents Vietnamese, and Japanese. Furthermore, the Japanese column includes both the Japanese word in hiragana or katakana along with an explanation in Vietnamese.

9. Korean-Vietnamese Dataset – With over 22,000 words in CSV format, this Korean-Vietnamese dataset includes the following columns: diacritical marks Vietnamese, no accents Vietnamese, and Korean.

10. Vietnamese-English-German Dataset – This multi-language text dataset includes over 11,000 Vietnamese words along with their English and German translations. The data is in CSV format with the following columns: diacritical marks Vietnamese, no accents Vietnamese, English, and German. 

 

Vietnamese Data Services

We hope that the datasets above helped you find the Vietnamese text data you were looking for. However, if you still haven’t found the right data for your project, contact us to learn how Lionbridge can help you. With a community of over 1 million multilingual workers, we can create custom Vietnamese datasets for multiple use cases. 

 

Vietnamese Data Annotation Services

Lionbridge provides professional annotation services for Vietnamese data.
Some of our most in-demand services include:

Interested? Get high-quality data now
The Author
Lucas Scott

Lucas is a seasoned writer, with a specialization in pop culture and tech. He spends most of his free time coaching high-school basketball, watching Netflix, and working on the next great American novel.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.