Article by Alex Nguyen | June 15, 2019

Multilingual machine learning models rely heavily on structured data. However, it can be difficult to find enough data to build models in languages other than English.

We at Lionbridge have put together a list of high quality Italian text and audio datasets to help. If you missed our previous dataset lists, be sure to check out our 50 best datasets, NLP datasets, and more.


Italian Text Datasets

ISST-CoNLL: a multi-layered annotated corpus of Italian developed through a cooperation between the Dipartimento di Informatica of the University of Pisa and Istituto di Linguistica Computazionale (ILC) of the National Council for Research (CNR).

Wacky Corpus: Syntactically annotated or POS-tagged corpora with up to 2 billion words for English, French, German and Italian from the Wikipedia corpus.

PhonItalia: A dataset that contains phonological representations for 120,000 Italian word forms.

Italian affective norms: A corpus that contains affective norms for 1,121 Italian words.

La Repubblica Corpus: A corpus of 380 million tokens of Italian newspaper texts that has been POS-tagged, lemmatized and categorized by genre.

Atlante Lessicale Toscano (ATL Lexical Atlas of Tuscany): A dataset containing lexical atlas and demographic data meant to serve as a dialectal resource for Tuscan dialects in Italy.

2007 CoNLL Shared Task: Greek, Hungarian & Italian parallel corpus containing dependency treebanks used for multilingual dependency parsing and domain adaptation.


Italian Audio Datasets

CommonVoice: 40 hours of Italian audio featuring 600 voices packaged in an mp3 format. The dataset also includes metadata like age, gender, and accent that can help train the accuracy of Italian speech recognition engines.

EMOVO Corpus: An Italian emotional speech database built from the voices of up to 6 actors who played 14 sentences simulating 6 emotional states (disgust, fear, anger, joy, surprise, sadness).

Multi-SPeaKing-style Articulatory corpus (MSPKA): is an Italian corpus of simultaneous recordings of continuous speech and trajectories of important speech articulators (i.e. tongue, lips, incisors) tracked by Electromagnetic Articulography in different speaking styles (e.g. read speech, hyperarticulated speech, hypoarticulated speech)

The Italian Speech Corpus 1: A dataset that contains the recordings of 202 native Italian speakers recorded in an office and a closed public place, in a range of low to medium background noise environments.

Italian Speech Recognition Corpus: Nearly 500 hours of mobile-recorded Italian audio. contains 377k utterances designed to provide materials for both training and testing of speech recognizers.


