12 Best Turkish Language Datasets for Machine Learning

Article By Alex Nguyen | December 27, 2019

A large amount of AI training data is the key to building multilingual machine learning applications, especially for complex languages such as Turkish. But where is the best place to look for Turkish data? 

To help, we’ve compiled a list of the top open-source Turkish datasets available on the web. Let’s jump right in!

 

Turkish Text Datasets

TS Corpus: The Turkish Corpus Project contains over 491 million tagged tokens. The TS Corpus V2 serves as the main dataset, but they have also released 10 additional corpora that contain a range of content types, from Turkish social media posts to idioms and proverbs.

Turkish National Corpus (TNC): The TNC is a large scale, general-purpose Turkish text corpus. The corpus is comprised of 50 million words in contemporary Turkish.

Bilkent Turkish Writings Dataset: This dataset contains content from Turkish creative writing courses between 2014-2018. All in all, there are nearly 7,000 texts available for download in CSV format.

Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment dictionaries for 81 languages, including Turkish.

Old Newspapers: This corpus contains natural language text from various newspapers, social media posts and blog pages in multiple languages. Overall, the corpus contains nearly 17 million sentences in 67 languages, including Turkish.

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset: These two datasets are comprised of automatically categorized and annotated sentences taken from  the Turkish and English Wikipedia. They were created for named entity recognition and text categorization respectively.

 

Turkish Parallel Corpora

The English-Swedish-Turkish Corpus: With a goal to promote research in the Turkish language, this corpus consists of original texts and their translations from Turkish to Swedish and English. As such, it’s organized so that texts, paragraphs, sentences, and words are in line with each other. 

Bianet Corpus: This corpus contains over 3,000 Turkish articles with their sentence-aligned Kurdish or English translations. All content comes from the Bianet online newspaper archives. 

OPUS Parallel Corpora: This set of text corpora contains aligned sentences in 40 languages, including Turkish. As a result, users can check translation sentence pairs for many languages. 

 

Turkish Speech Datasets

Spoken Turkish Corpus : This corpus contains eighteen groups of Turkish audio recordings. The data comes from the Radyo ODTÜ archive.

Middle East Technical University Turkish Microphone Speech: This corpus contains aligned text and speech data from 120 speakers aged 19 to 50. The data features an even male/female split, with each person speaking 40 sentences each. Overall, the dataset contains approximately 500 minutes of speech. 

Turkish Broadcast News Speech and Transcripts: Developed by Bogaziçi University, this dataset contains approximately 130 hours of Turkish radio broadcasts and corresponding transcripts.

 

Still can’t find what you need? We provides custom multilingual datasets for over 300 languages and dialects. Whether you require hundreds or millions of data points, our 1,000,000+ certified contributors can ensure your algorithm has a solid ground truth.

Interested? Get high-quality data now
The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.