In all areas of machine learning, having large amounts of high quality data is paramount. The success of machine learning models is founded upon the availability of highly specialized and structured data. However, it can be difficult to find enough relevant data for languages other than English.
To help, we at Lionbridge have compiled a list of high quality French datasets that covers a wide spectrum of AI use cases, from sentiment analysis to audio datasets. Let’s jump right in!
French Text Datasets
REDAC: From the French Wikipedia, this dataset contains 262 million words of raw text and part of speech tagged corpora in French..
French Lexicon Project: Contains French lexical decision data for 38,840 French words and 38,840 pseudowords.
Lexique: Database that provides word frequencies, lemmas, phonemic representations, syllabation, and other information for 142,000 words of the French language.
Wacky Corpus: Part of speech tagged corpora with up to 2 billion words for English, French, German and Italian.
French Reddit Discussion: A French dialog corpus that contains a rich collection of over 550K spontaneous written conversations extracted from Reddit’s public dataset.
French news articles: French news articles from the top 10,000 news sites, including 245,308 documents.
French Stopwords: The most comprehensive collection of stopwords for the French language in JSON and text formats.
French Parallel Text Datasets
Aligned Hansards of the 36th Parliament of Canada: 1.3 million pairs of aligned text chunks in English and French.
Chinese-French Text: A parallel corpus of French translations of approximately 30,000 Chinese characters from Chinese Broadcast News.
French-Arabic Newspapers: A corpus of 10,000 words of Arabic news articles and 2 reference translations in French.
Pashto-French Text: A corpus that contains the transcription of 106 hours of recordings in Pashto translated into French.
Europarl English-French Machine Translation Dataset: Text corpora containing 2 million training and 45K test sentences from 21 languages from the proceedings of the European Parliament between 1996 and 2011.
German-French website parallel corpus: German-French texts extracted from the website of the Federal Foreign Office Berlin.
Spanish-French website parallel corpus: From the EU Open Data Portal, this is a parallel corpus of bilingual texts crawled from multilingual websites.
French Sentiment Analysis Datasets
Datasets for Aspect-Based Sentiment Analysis in French: Contains 457 restaurant reviews and 162 museum reviews for the development and testing of ABSA systems for French. All data is annotated with relevant entities, aspects and polarity values.
Sentiment Lexicons for 81 Languages: Contains both positive and negative sentiment lexicons for 81 languages, including French.
French Audio Datasets
Nijmegen Corpus of Casual French: 35 hours of high-quality recordings featuring 46 French speakers conversing among friends, orthographically annotated by professional transcribers.
French Single Speaker Speech Dataset: CSS10 is a collection of single speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox.
Traitement de Corpus Oraux en Français (TCOF): Over 500 transcriptions of 124 hours of spoken French. The corpus is divided into two main categories: adult-child interactions (children up to 7 years old) and records of interactions between adults.
VoxForge: Set up to collect transcribed speech for use in Open Source Speech Recognition Engines, VoxForge contains 37.5 hours of oral recordings of texts in French.
Still can’t find what you need? Lionbridge AI provides custom multilingual datasets for over 300 languages. Whether you need hundreds or millions of data points, our 500,000+ certified contributors can ensure your algorithm has a solid ground truth.