Arabic is spoken by over 420 million people globally, and is one of the six official UN languages. However, despite being one of the largest, most influential world languages, Arabic has received little attention in the realm of natural language processing.
Arabic poses many challenges for computational processing. The Arabic language itself is highly ambiguous, linguistically complex and varied. Because data is central to NLP system development, collecting high quality Arabic training data is a major roadblock.
We at Lionbridge have compiled a list of the best public Arabic language data for machine learning. The data come in many forms, including aligned parallel bilingual texts, annotated handwriting and regional audio recordings to apply to a range of machine learning use cases.
Arabic Text Datasets
Quranic Arabic Corpus: An annotated linguistic resource showing Arabic grammar, syntax and morphology for each word in the Quran.
Corpus of Contemporary Arabic (CCA): Created for language teachers, language engineers, and foreign learners of Arabic, this corpus contains about 1M annotated Arabic words.
Arabic Learner Corpus (ALC): A collection of written and spoken materials produced by learners of Arabic in Saudi Arabia. The ALC contains 0.2 million Arabic words produced from 942 students.
Arabizi Text: A collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts.
Arabic Poetry Dataset: The dataset contains over 58K poems scraped from adab.com. Poems extend from the 6th century to the present day and include metadata such as poet name and category.
Arabic Parallel Text Datasets
French-Arabic Newspapers: A corpus of 10,000 words in Arabic alongside reference translations in French. The source texts are articles collected in May 2013 from the Arabic version of Le Monde Diplomatique.
OntoNotes: Annotated corpus containing various genres of text in in English, Chinese, and Arabic. The text varies in topic, including news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows and more.
Arabic Handwritten Character Datasets for Optical Character Recognition (OCR)
Arabic Handwritten Digits Dataset: Handwritten Arabic digit dataset including 60,000 training images and 10,000 test images written by 700 writers. Each writer wrote each digit ten times.
Arabic Handwritten Characters Dataset: A dataset of 16,800 characters written by 60 participants. Participants were between 19 to 40 years of age, and 90% were right-hand.
Yarmouk Arabic OCR Dataset – From researchers at the American University at Kuwait and Yarmouk University, this dataset Arabic OCR dataset contains 8,994 images from 4,587 Arabic articles taken from Wikipedia.
Arabic Audio Datasets
RATS language identification: A dataset that contains almost 5,400 hours of Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments.
Arabic Broadcast News Transcripts: This dataset contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009.
Arabic Natural Audio Dataset: Eight videos of online Arabic talk shows categorized by speaker and emotion. Silence, laughs and noisy chunks were removed and all audio is automatically divided into 1 second speech units.
Still can’t find what you need? Lionbridge AI provides custom multilingual datasets for over 300 languages. Whether you need hundreds or millions of data points, our 500,000+ certified contributors can ensure your algorithm has a solid ground truth.