In-Depth

Go Beyond OCR with Amazon Textract

The new machine learning service is designed to identify usable data without the need for custom coding.

One of the best things about the Amazon cloud is that Amazon is continually introducing new services and features. One of the more interesting services that Amazon previewed in late November 2019 is Textract. At its simplest, Textract could be thought of as optical character recognition (OCR) software. However, Textract goes far beyond the capabilities that are usually associated with OCR.

OCR is one of those technologies that never really lived up to the hype. Yes, I realize that there are companies that are very successfully using OCR software in production. However, I haven't personally had such great luck with OCR software.

Back in the 1990s, I briefly operated an online store. Rather than manually transcribing my price list into the Web application's database, I thought that I could save myself a lot of time by scanning the price list and using OCR on the resulting file. Although the OCR software did a surprisingly good job, the resulting text was not in a usable format. I had to spend quite a few hours coding an application that would transform my scanned text into a format that my Web application could use. I spent so much time scanning the text, performing OCR, and then transforming the resulting text that I probably could have just transcribed the price list instead.

At the time, I was willing to overlook these difficulties, because OCR technology was still brand new. My problem with OCR, however, is that the software hasn't really evolved very much in the last 20 years. A couple of weeks ago, for example, I was working with a tool that uses OCR to extract text from fax messages. Although this seems like a really straightforward task, the process suffered from misrecognized characters and incorrect formatting. More specifically, some of the paragraph breaks were completely removed, while stray whitespace was inserted into a couple of seemingly random places.

At the time of writing, Amazon's Textract had not yet been released (this week the company announced the general availability of the service to select regions), so I haven't had a chance to try it out for myself yet. Even so, the machine learning service seems to be an attempt to finally fix everything that's wrong with OCR software.

The main thing that differentiates Textract from other OCR applications is that it's specifically designed to identify usable data without the need for custom coding. This is an extraordinarily important consideration, because it means that the form itself can be regarded as irrelevant.

Imagine for a moment that an insurance company is trying to use OCR as a tool for processing thousands of paper enrollment forms. In the past, there would likely have been a back-end application that tells the OCR engine how to read the form. The application might, for example, define the boundaries of the various fields on the page, and also define the data that's expected to exist in each of those fields.

Although this approach works, there are two big problems with handling OCR in this way. First, there's a considerable amount of effort involved in writing code that helps an OCR engine to know what types of data to expect at various locations on a page. From a business prospective, there's a cost associated with the development of such code, and the scanning process cannot commence until the code is complete and has been thoroughly tested.

The second problem with this approach is that if someone decides to change the form, then the change will break the code that defines the form.

Amazon's approach is to use machine learning to identify data types on a page. In America, for example, a number in the format of xxx-xx-xxxx is typically going to be a Social Security number. As such, Textract can look for this type of numerical pattern and identify any matches as a Social Security number. Similarly, phone numbers generally adhere to a known format, (xxx) xxx-xxxx, as do addresses, xxx Name of Street, City, State, ZIP code. Names are a bit trickier, but some names are very common and the detection of such names might help Textract to identify a Name field.

What's really nice is that Amazon allows you to capitalize on this type of form intelligence without having to perform any machine learning coding yourself. You can simply tell Textract what to do with the data, and Textract will worry about locating the data on the page.

Although Textract is now only available in select regions, the technology seems really promising. I plan to write a follow up post once I've actually had the chance to try out Textract for myself. You can read the Amazon Textract FAQ page for more details about the service.

About the Author

Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.

Featured

Subscribe on YouTube