Despite recent advances in machine learning, developing Russian natural language processing (NLP) systems remains a big challenge for researchers. Russian, like other Slavic languages, is a morphologically rich language with a free word order and large degree of inflection. These linguistic factors make it difficult to collect enough relevant multilingual data to train machine learning models.
To help, we at Lionbridge AI have put together an exhaustive list of the best Russian datasets available on the web, covering everything from social media data to natural speech.
Russian Text Datasets
Russian Common Crawl Data: Over 500 terabytes of raw Russian text data crawled from the web.
OpenCorpora Russian: A tagged corpus of 1.5 million Russian words encoded in UTF-8.
Uppsala Russian Corpus: A million word corpus designed to be as representative and varied as possible. The texts cover 25 different subject areas, such as economics, law, and medicine.
Taiga Corpus: 6.5 million tokens in Russian available upon request.
Classification of Handwritten Letters: Over 10,000 images of handwritten Russian characters useful for training text classification and image generation systems.
StimulStat: A lexical database for Russian which allows selecting words and word forms based on different parameters for a list of words or word forms.
NLP datasets for Russian language: A repository that includes a variety of NLP datasets and resources in Russian, including conversation dialogues, word lists, question-answer data and more.
Russian Social Media Datasets
Russian Twitter Corpus: A corpus of 17.6 million tweets in Russian suitable for training language model for social media, content moderation, sentiment analysis and more.
Russian Troll Tweets: A dataset containing over 200,000 malicious-account tweets captured by NBC.
SentiRuEval trainset: Over 20,000 Russian tweets tagged with sentiment data.
Russian Audio Datasets
Russian Speech Database: Recorded in 1996-1998, the STC Russian speech database was created to investigate individual speaker variability and validate speaker recognition algorithms.
Russian Single Speaker Speech Dataset: CSS10 is a collection of single speaker speech datasets for 10 languages, including Russian.
M-AILABS Speech Dataset: A large Russian audio dataset, freely usable as training data for speech recognition and speech synthesis.
Russian Open Speech To Text (STT/ASR) Dataset: A dataset containing 4,000+ of diverse, cross domain speech to train speech-to-text models in Russian.
Still can’t find what you’re looking for? Lionbridge has extensive experience creating custom annotated data in Russian for a wide range of machine learning applications. With 20 years of experience and 500,000+ qualified native speakers around the world, multilingual datasets are Lionbridge’s strength.