Multilingual machine learning models require massive amounts of training data in order to produce intelligible results. However, for languages other than English, it can be difficult to find enough relevant data. In particular, Dutch is a challenging language with many strange aspects, including unexpected loanwords and very complex spelling.
Where’s the best place to look for Dutch datasets to train machine learning systems? We at Lionbridge have gathered a list of publicly available datasets to help you out.
Dutch Text Datasets
Dutch Lexicon Project: A dataset that contains lexical decision data for more than 14,000 Dutch words.
SUBTLEX-NL: A database of Dutch word frequencies based on 44 million words from film and television subtitles.
Leuven Concept Database: Dutch norms for 129 animal names and 166 artifact names in 11 categories.
Dutch AoA and concreteness data: Ratings for age-of-acquisition and concreteness for 30,000 Dutch words available in an Excel format.
Dutch Word Knowledge & Prevalence: Word prevalence values for over 50,000 Dutch words gathered from nearly 300,000 participants.
Groningen Twitter Corpus: A Twitter corpus containing approximately 2.6 billion tweets in Dutch and 28 billion tokens collected in 2014.
Dutch action norms: A dataset containing Dutch age of acquisition, word frequency and other norms for 124 line drawings.
Delpher Dutch Newspaper Archive (1618-1699): A text dataset comprised of over 8,500 Dutch-language newspapers from 1618-1699.
Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment lexicons for 81 languages, including Dutch.
Dutch Parallel Text Datasets
Child Language Data Exchange System (CHILDES): A dataset of transcribed and annotated child language for several languages, including Dutch.
CELEX2: This corpus contains ASCII versions of a lexical databases in English, Dutch, and German. For each language, this data set contains detailed information on orthography, phonology, morphology, syntax and word frequency.
ECI Multilingual Text: The first release of the European Corpus Initiative, this multilingual corpus is comprised of 46 subcorpora in 27 languages, including Dutch. The total size of these is roughly 92 million (lexical) words.
Dutch Audio Datasets
Dutch Single Speaker Speech Dataset: A collection of single speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox.
Spoken Wikipedia Corpora: Corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia. The Dutch language portion of this corpus contains 3073 articles read by 145 speakers. There are 224 hours of speech, of which 79 hours is aligned at the word level.
Can’t find what you need? Lionbridge employs 500,000+ qualified linguists, efficient project management and the latest technology to save you time and money while providing high-quality Dutch training data for machine learning use cases.