A large amount of AI training data is the key to building multilingual machine learning applications, especially for complex languages such as Turkish. But where is the best place to look for Turkish data?
To help, we’ve compiled a list of the top open-source Turkish datasets available on the web. Let’s jump right in!
Turkish Text Datasets
TS Corpus: The Turkish Corpus Project contains over 491 million tagged tokens. The TS Corpus V2 serves as the main dataset, but they have also released 10 additional corpora that contain a range of content types, from Turkish social media posts to idioms and proverbs.
Turkish National Corpus (TNC): The TNC is a large scale, general-purpose Turkish text corpus. The corpus is comprised of 50 million words in contemporary Turkish.
Bilkent Turkish Writings Dataset: This dataset contains content from Turkish creative writing courses between 2014-2018. All in all, there are nearly 7,000 texts available for download in CSV format.
Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment dictionaries for 81 languages, including Turkish.
Old Newspapers: This corpus contains natural language text from various newspapers, social media posts and blog pages in multiple languages. Overall, the corpus contains nearly 17 million sentences in 67 languages, including Turkish.
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset: These two datasets are comprised of automatically categorized and annotated sentences taken from the Turkish and English Wikipedia. They were created for named entity recognition and text categorization respectively.
Turkish Parallel Corpora
The English-Swedish-Turkish Corpus: With a goal to promote research in the Turkish language, this corpus consists of original texts and their translations from Turkish to Swedish and English. As such, it’s organized so that texts, paragraphs, sentences, and words are in line with each other.
Bianet Corpus: This corpus contains over 3,000 Turkish articles with their sentence-aligned Kurdish or English translations. All content comes from the Bianet online newspaper archives.
OPUS Parallel Corpora: This set of text corpora contains aligned sentences in 40 languages, including Turkish. As a result, users can check translation sentence pairs for many languages.
Turkish Speech Datasets
Spoken Turkish Corpus : This corpus contains eighteen groups of Turkish audio recordings. The data comes from the Radyo ODTÜ archive.
Middle East Technical University Turkish Microphone Speech: This corpus contains aligned text and speech data from 120 speakers aged 19 to 50. The data features an even male/female split, with each person speaking 40 sentences each. Overall, the dataset contains approximately 500 minutes of speech.
Turkish Broadcast News Speech and Transcripts: Developed by Bogaziçi University, this dataset contains approximately 130 hours of Turkish radio broadcasts and corresponding transcripts.
Still can’t find what you need? We provides custom multilingual datasets for over 300 languages and dialects. Whether you require hundreds or millions of data points, our 1,000,000+ certified contributors can ensure your algorithm has a solid ground truth.