What is Optical Character Recognition (OCR)?

Article by Rei Morikawa | June 17, 2019

Optical character recognition (OCR), also called text recognition, is the technology that converts images to text so that computers can extract text data from image files. OCR technology classifies optical patterns in digital images, based on how they correspond to alphanumeric characters.

OCR can be a huge productivity shortcut for students, researchers, and entrepreneurs who deal with a lot of documents. Once you process a document with OCR technology, you can easily edit, search, index, and retrieve the text data. You can also compress the document into zip files, highlight keywords, or incorporate it into a website.

 

How does optical character recognition (OCR) work?

OCR works by examining a physical document and translating the characters into code that can be used for data processing. The basic steps are image acquisition, preprocessing, segmentation, feature extraction, classification, and post-processing.

You’ll need to preprocess the training data thoroughly before feeding it into the model. Preprocessing tasks include thresholding (converting a colour or gray raw image into a binary image), normalization, and noise reduction. You can use various techniques such as morphological operations to connect unconnected pixels, remove isolated pixels, and smooth pixels boundary.

At the beginning of an OCR project, you’ll scan and copy the physical documents and have the OCR software convert them to a binary version. Then, the computer analyzes the scanned images for light and dark areas. It’ll identify light areas as background and dark areas as written characters that need to be recognized.

Next, the computer processes the dark areas to find alphabetic letters, numeric digits, and symbols. There are various techniques for OCR programs, but most involve targeting one character, word, or block of text at a time.

 

How are optical character recognition (OCR) systems trained?

You can train some OCR programs with pattern recognition. These models are trained with examples of texts in various fonts and formats which are then used to compare and recognize characters in the scanned document. Other OCR systems use feature detection, where the OCR program applies rules regarding the features of a specific letter, number, or symbol, to recognize characters in the scanned image. For example, some common features could be the number of angled lines, cross lines, or curves in a written character. Your OCR model might store the capital letter “A” as having two diagonal lines that meet with a horizontal line across the middle.

Finally, when your model identifies a written character or number, it can be converted into an ASCII (American Standard Code for Information Interchange) code. ACSII is the most common format for text files in computers and on the Internet, where each character or number is represented with a 7-bit binary number.

 

What is optical character recognition (OCR) used for?

You can use OCR for a variety of data entry and data categorization tasks. Here are a few examples.

 

Data Entry

OCR can automate data entry tasks for business documents. You can use OCR software to turn hard copies of legal or historical documents into PDF files. This way, you can edit, format, and search as if you created the document with a word processor.

 

Data Categorization

You can use OCR for a wide range of data categorization tasks. For example, you can automate sorting letters for mail delivery, or electronically depositing checks without the need for a bank teller.

Use cases include adding certified legal documents into an electronic database and indexing print material for search engines. You can also can use OCR to decipher documents into text, which you can then convert to audio for visually impaired users. More examples of OCR-powered technology include translation apps, online databases like Google Books, and security cameras to recognize license plates.

 

Lionbridge AI can help you build a handwriting or OCR model. We have 500,000 contributors who can build and annotate image datasets to train your model. In addition, our project management team will handle the details of your project from start to finish. We’ll take care of sourcing qualified contributors for your project, meeting important deadlines throughout the process, and more.

 

Featured image by Pietro.dipalma via Wiki Commons.

Get high-quality OCR training data now
The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.