22 Best Spanish Language Datasets for Machine Learning

Article by Alex Nguyen | June 05, 2019

One of the most difficult aspects of building a multilingual machine learning is collecting enough relevant data. To help, we have compiled a list of Spanish language datasets for machine learning to cover a range of use cases, from sentiment analysis to parallel translation corpora.

If you like what you see, be sure to check out our other datasets for machine learning.


Spanish Text Datasets

el corpus del español: A 100 million word corpus from over 20,000 Spanish texts spanning from 1200 to the 1900s.

MAS Corpus (Corpus for Marketing Analysis in Spanish): Contains manually tagged Twitter posts in Spanish for marketing purposes. Tags are provided for each tweet to describe three different aspects of the text.

120 Million Word Spanish Corpus: A medium-sized corpus containing 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010.

The TV News Archive: Over 705,000 captioned and searchable news programs from over 4 years of U.S. television networks.

Spanish norms for photographs: 140 color images normed by over one hundred native Spanish speakers on age of acquisition, manipulability, familiarity, and more.

Corpus of noise-induced Spanish misperceptions/confusions: A corpus of 3,235 misperceptions in Spanish. Each misperception is selected for the corpus if at least 6 listeners reported the same response from a group of 15 listeners.

Stopword Lists for 19 Languages: Lists of high-frequency words usually removed during NLP analysis for 19 different languages including Spanish.

Pre-trained Word Vectors for Spanish: Over 1 million 300-dimensional word vectors for Spanish  trained on the Spanish Billion Words Corpus.


Spanish Translation & Parallel Text Datasets

1000 parallel sentences: Parallel sentences featuring 1,000 frequently used words in Korean, English, Japanese, Spanish, and Indonesian. All sentences have been translated by native speakers of their respective languages.

Catalan-Spanish: A collection of documents from the official journal of the Catalan Government in Catalan and Spanish.

EU Open Data Portal: Access to European Union open data, here are some of the available Spanish parallel corpora of bilingual texts crawled from multilingual websites:


Spanish Sentiment Analysis Datasets

SAB Corpus (Spanish Corpus for Sentiment Analysis towards Brands): A corpus of tweets in Spanish annotated with the sentiment analysis towards brands.

TASS Dataset: A corpus of texts in Spanish tagged for sentiment analysis related tasks. It is divided into several subsets created for the various tasks proposed in the different editions through the years.

Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment lexicons for 81 languages including Spanish.


Spanish Audio Datasets

Spanish Single Speaker Speech Dataset: A collection of single speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox.

BACKBONE Pedagogic Corpus of Video-Recorded Interviews: A web-based pedagogic corpora of video-recorded interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English.

Hamburg Corpus of Argentinian Spanish (HaCASpa): 19 hours of spoken Argentinian Spanish containing spontaneous speech and reading tasks.

Catalan in a bilingual context (PhonCAT): 144 hours of spoken Catalan compiled between July 2006 and June 2011, comprised of elicited and spontaneous speech data from speakers of Catalan in Barcelona. The data is annotated based on speaker age and the district in Barcelona in which they live.


Still can’t find what you need? Lionbridge AI provides custom multilingual datasets for over 300 languages. Whether you need hundreds or millions of data points, our 500,000+ certified contributors can ensure your algorithm has a solid ground truth. Contact us to learn more about how we can help.

Interested? Get high-quality data now
The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.