15 Best OCR & Handwriting Datasets for Machine Learning

Article by Meiryum Ali | March 07, 2019

What is optical character recognition?

Optical character recognition (OCR) is the technology that enables computers to extract text data from images. Once a document (typed, handwritten or printed) undergoes OCR processing, the text data can easily be edited, searched, indexed and retrieved. OCR probably powers many of the systems in services that you use daily. Some of the applications of OCR include automatic data entry for business documents, translation apps, online databases like Google Books, security cameras that automatically recognize license plates, and more.

 

If you’re interested in learning more about OCR, or looking for training data to develop your own OCR system, we at Lionbridge AI have put together this list of the best OCR and handwriting datasets to help you out.

 

OCR & Handwriting Datasets for Machine Learning

NIST Database: The US National Institute of Science publishes handwriting from 3600 writers, including more than 800,000 character images.

MNIST Database: A subset of the original NIST data, has a training set of 60,000 examples of handwritten digits.

Devangri Characters: A dataset of handwritten Devangari characters, composed of 1800 samples from 36 character classes obtained by 25 native writers.

Mathematics Expressions: More than 10,000 expressions, including more than 101 mathematical symbols.

Chinese Characters:  A dataset of handwritten Chinese characters containing 909,818 images that corresponds to about 10 news articles.

Arabic Printed Text: Contains a lexicon of 113,284 words, and uses 10 Arabic fonts.

Document database: Contains 941 online handwritten documents by 189 writers, and covers lists, tables, formulas, diagrams and drawings.

Iam On-line Handwriting: Contains forms of handwritten English text acquired on a whiteboard, and includes more than 1700 entries.

Street View Text: The Street View Text dataset was harvested from Google Street View, and mostly deals with outdoor street level signs and boards.

Street View House Numbers: Contains 73257 digits of house street numbers, taken from Google Street View.

Natural Environment OCR: A dataset that contains 659 real world images with 5238 annotations of text.

Scene Text: Contains 3000 images captured in different environments, including outdoors and indoors scenes under different lighting conditions (clear day, night, strong artificial lights, etc).

Text Detection: Contains 500 natural images, which are taken using a pocket camera. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards.

Stanford OCR: Contains handwritten words dataset collected by MIT Spoken Language Systems Group, published by Stanford.

Chars74K Data: This has 74K images of both English and Kannada digits.

 

If you missed our previous dataset articles, be sure to check out The 50 Best Free Datasets for Machine Learning and The Best 25 Datasets for Natural Language Processing.

 

Still can’t find what you need? Reach out to Lionbridge AI — we provide custom AI training dataimage tagging, data annotation services and more. We manage the entire process, from designing a custom workflow to sourcing qualified workers for your specific project. Our team also includes over 500,000 qualified native speakers in 300 languages.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.