Arguably the biggest roadblock standing in the way of global NLP development is the scarcity of diverse multilingual training data. Diverse text and speech data is imperative to building machine learning models.
In the case of Indian languages in particular, subtle differences in regional accents, diction, enunciation and slang make it difficult to develop accurate NLP applications for the Indian market. In recent years, many diverse open source datasets have been published to help accelerate the development of machine learning for Indian languages. To help, we at Lionbridge have put together a list of the best Hindi language datasets for machine learning.
Hindi Text Datasets
Semantic Relations from Wikipedia: A dataset of automatically extracted semantic relations from the multilingual Wikipedia corpus
Hindi WordNet: Dataset that contains lexical and semantic relations between Hindi words.
Hindi Health Dataset: Created to help researchers to upgrade their research in the Hindi language, the HHD corpus contains Person, Disease, Consumable and Symptom related text from Indian websites and published research papers.
HC Corpora (Old Newspapers): A cleaned subset of the HC Corpora newspapers. This version contains 16,806,041 sentences/paragraphs in 67 languages, including Hindi.
Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment lexicons for 81 languages, including Hindi.
Hindi Parallel Text Datasets
Code Mixed (Hindi-English) Dataset: A newspaper directory with utf-8 encoded text files corresponding to the various categories (entertainment, sports, technology, business).
IIT Bombay English-Hindi Parallel Corpus: Developed at the Center for Indian Language Technology, this dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus.
HindEnCorp 0.5: This corpus contains parallel text taken from TED talks, news articles, Wikipedia articles, etc in Hindi and English.
Indic Languages Multilingual Parallel Corpus: A parallel corpus that covers 7 Indic languages in addition to English. It contains 7 language directions for Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English.
Hindi Audio Datasets
Microsoft Speech Corpus (Indian languages): Created by the Microsoft Research Open Data initiative, this corpus contains conversational and phrasal training and test data for Telugu, Tamil and Gujarati.
Hindi Speech Recognition Corpus: Collected in India, this corpus contains the voices of 200 different speakers from various demographics (gender, age, regional accents). The script contains 100 pairs of daily spontaneous conversational speech data.
Indian Hindi Film Music: A dataset that contains a list of Hindi songs from 1950 to 1990 scraped from the internet. The fields include song title, movie, year of release, music director, song type, singer type, singers and the links to audio and video of the song.
Still can’t find what you’re looking for? With 20 years of experience and 500,000+ qualified native speakers around the world, multilingual datasets are Lionbridge’s strength. We can collect custom text and speech datasets for a variety of use cases to match your training data needs.