How to Extract Information from documents: Template Matching

Template matching
Information Extraction (IE)

What is Template Matching

Let’s say that you have a digital document of an invoice and you would like to automate the extraction of the ‘invoice Number’, ‘Client Name’, and the ‘Total Amount’. You receive the invoice in an image format, now, to extract text from an image you would need an OCR (Optical Character Recognition) software, if you don’t know, an OCR is the conversion of images of text (typed, handwritten, or printed) into machine-encoded text.

iPhone built-in OCR
Extracted Text
Template with zones of interest

How to build your Template Matching solution

The first step to TM is to create a template, which means getting the representation of the zones of interest. if we were to present a template for the Invoice Image above it would look like this in a JSON file:

Template representation in JSON
Zone with zoom, regex, correction parameters
Resize the image to template size and crop zones
crop image with Wand python
Output in JSON

Limitations of Template Matching

TM is a great solution, straightforward and easy to implement, it works perfectly, especially for documents that are electronically generated, where documents quality and format are intact.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aicha Fatrah

Aicha Fatrah

Software Engineer | Technical Writer | IT Enthusiast