20 Best German Language Datasets for Machine Learning

Article by Alex Nguyen | May 23, 2019

Perhaps one of the most challenging parts of training a multilingual machine learning algorithm is finding enough relevant or specialized data. To help, we at Lionbridge AI have compiled a list of German language datasets that covers a wide spectrum of AI use cases, from sentiment analysis to audio datasets.

If you like what you see, be sure to check out our other datasets for machine learning.

German Text Datasets

Huge German Corpus (HGC): A collection of 12.2 million sentences of German newspaper and law texts. All content has been lemmatized and part-of-speech tagged by TreeTagger.

3 Million German Sentences: 3 million German sentences taken from 2015 newspaper texts. Non-sentences and non-German text has been removed, and information on word frequency is also included.

German Recipes Dataset: 12,190 German recipes taken from chefkoch.de. Each document contains information about ingredients, instructions, creation date and more.

German Political Speeches Corpus: A collection of 21st century political speeches held by top German representatives from the German Presidency, Ministry of Foreign Affairs, Chancellery, and Presidency of the Bundestag.

NEGRA: A syntactically annotated corpus of German newspaper texts. Free on request for all Universities and non-profit organizations. However, you need to sign and send a form in order to obtain the complete dataset.

Digitales Woerterbuch der deutschen Sprache (dlexDB): A lexical database for psychological and linguistic research in German. The dataset contains over 100 million German word tokens.

Ten Thousand German News Articles Dataset: The first German topic classification dataset. It contains 10,273 German language news articles split up into nine classes.

SUBTLEX-DE: Word frequencies of 25.4 million words from film and television subtitles.


German Translation & Parallel Text Datasets

Cross-lingual projection of semantic roles: An annotated 1000-sentence dataset from the English-German EUROPARL bitext parallel corpora.

German-English Text: A manually aligned German-English parallel corpus for word alignment.

Vietnamese German Dataset: Vietnamese-German dictionary used for model transformation languages in deep learning, machine learning, and dictionary applications.


German Sentiment Analysis Datasets

SentimentWortschatz: A German sentiment analysis toolkit containing 3,468 German words sorted by sentiment. It lists positive and negative polarity bearing words as well as their part-of-speech tag and inflections (if applicable).

The Potsdam Twitter Sentiment Corpus: A dataset of 7,992 German tweets manually annotated with fine-grained opinion relations. The dataset includes sentiment-relevant elements such as opinion spans, their respective sources and targets, as well as terms with their possible contextual negations and modifiers.

German Emotion Dictionary: In this repository, dictionaries for German emotion analysis for seven fundamental emotions are available.

SCARE: A sentiment corpus of Google Play Store app reviews with fine-grained annotations in German. For each review the mentioned application aspects (e.g. application design or usability), subjective phrases, and polarity are annotated.

Opinion Compound Dataset: A dataset of roughly 3,000 German compounds that have been annotated with regard to opinion roles.

ANGST German affectiveness ratings: Valence, arousal, dominance ratings for about one thousand German words.


German Audio Datasets

Open Speech Data Corpus for German: Audio recordings using several speakers from the LT and the Teleccoperation group. The dataset contains roughly 35 hours of speech, featuring about 180 speakers reading sentences from German Wikipedia, protocols from European Parliament and individual commands.

Spoken Wikipedia Corpora: Dataset of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia. Hundreds of hours of aligned audio, and annotations can be mapped back to the original html.

CSS10 German: Single speaker speech datasets in German, composed of short audio clips from LibriVox audiobooks and their aligned texts.


Still can’t find what you need? Lionbridge AI provides custom multilingual datasets for over 300 languages. Whether you need hundreds or millions of data points, our 500,000+ certified contributors can ensure your algorithm has a solid ground truth. Contact us to learn more about how Lionbridge AI can help.

Interested? Get high-quality data now
The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.