Where’s the best place to look for language datasets to train machines learning translation? We combed the web to create the ultimate cheat sheet. The following list includes parallel translations data for machine translation training in many different languages including English, Mandarin Chinese, Vietnamese, and more.
What is Machine Translation?
Machine translation (also called automated translation, automatic or instant translation) refers to translations that are automatically created with a computer, with no human involvement. Machine translation can be useful for translators because it allows them to be more productive, and deliver translations faster.
Like most machine-learning models, effective machine translation (MT) requires massive amounts of training data in order to produce intelligible results. A parallel text translation corpora is a structured set of translated texts between two languages. Such parallel text corpora are essential for training machine translation algorithms.
Parallel Translation Data Sources for Machine Learning
Aligned Hansards of the 36th Parliament of Canada: 1.3 million pairs of aligned text segments in English and French. The text is taken from the official records of the 36th Canadian Parliament.
European Parliament Proceedings Parallel Corpus 1996-2011: Sentences pairs in 21 European languages. All text contains meta data including the document, speaker, and paragraph.
Global Voices Parallel Corpus: A selection from the news portal Global Voices, that features the same news article in 57 different languages. The corpus is updated on a quarterly basis.
Chinese-French Text: A dataset that contains French translations of a subset of approximately 30,000 Chinese characters from Chinese Broadcast News.
Arabizi Text: Training data for the automatic detection of code-switching in mixed English and Arabizi (Arabic chat language) texts made up of 522 tweets.
English-Vietnamese Text: A corpus of 500,000 English documents translated by professional translators into Vietnamese. The source texts include books, dictionaries, newspapers, online news, collected between 2000 and 2007.
English-Persian Text: Contains more than 200,000 aligned sentences in English and Persian from the domains of law, literature, science, art, politics and others.
Chinese-English Emails: Contains 15,000 characters in Chinese (equivalent to 10,000 words) from emails, and a reference translation in English.
French-Arabic Newspapers: A corpus of 10,000 words in Arabic and 2 reference translations in French. The source texts are articles collected in May 2013 from the Arabic version of Le Monde Diplomatique.
Pashto-French Text: Consists of the transcription of 106 hours of recordings in Pashto translated into French.
German-English Text: A set of manually aligned datasets in German, English and Turkish.
Turkish-English Text: A Turkish-English parallel corpus for WMT2018.
UN translation text: A collection of translated documents from the United Nations in 6 different languages.
XhosaNavy: South African Navy parallel corpus from Herman Engelbrecht at the Department of E&E Engineering at Stellenbosch University.
Wikipedia: A large corpus of millions of parallel sentences extracted from Wikipedia across 20 languages.
English-Croatian: Parallel document pair candidates in English and Croatian.
Catalan-Spanish: A collection of documents from the official journal of the Catalan Government in Catalan and Spanish.
English-Japanese: This dataset is a source of English-Japanese parallel translation data with about 500,000 pairs of manually-translated sentences from Wikipedia’s Kyoto Articles in both languages.
OntoNotes: Annotated corpus containing various genres of text – news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows – in English, Chinese, and Arabic.
Chinese Treebank: Contains approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, and various broadcast news
Arabic Broadcast News Transcripts: Contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009
Parallel Speech Datasets for Machine Learning
RATS language identification: 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments.
Turkish Telephone Speech: Contains approximately 18 hours of telephone speech in Turkish. The data was to support research and technology evaluation in automatic language identification.
Central Europe Telephone Speech: Contains approximately 44 hours of annotated telephone speech in Czech and Slovak.
South Asia Telephone Speech: Contains approximately 118 hours of annotated telephone speech in Bengali, Hindi, Punjabi, Tamil and Urdu.
We hope you found this list of parallel translations data sources useful for your projects.
Still can’t find what you need?
For companies seeking to improve their machine translation engines, Lionbridge AI can create translation corpuses and other training data across 300+ languages. Our crowd of 500,000+ qualified linguists will deliver the volume you need to build and train an effective machine translation system.