How to Extract Information from documents: Template Matching
Key Information Extraction from documents using TM technique
Most workflows in different businesses deal with files like invoices, orders, tax forms, financial reports, mail, questionnaires, etc. Which produces an enormous amount of documents, handling these documents manually takes a lot of time and labor. Document information extraction (IE) is the process of automatically extracting information from unstructured visual or textual sources to enable finding relevant values and later transforming the output into structured data that can be stored. This speeds up the processing time and reduces expenses, and also makes it less prone to human errors.
For the past few years, I worked on digital transformation projects to solve the document digitalization and automation problems, and I learned with clients that there exist many methods and techniques to do so, it depends on many aspects like the document acquisition methods, the nature of documents and the entire process of the document management system.
I decided to create a series of articles highlighting different techniques I used in IE. In this article, I’ll explain how the TM technique works, how you can implement your TM solution for your business, and I will also mention the drawbacks and limitations of this method.
What is Template Matching
Let’s say that you have a digital document of an invoice and you would like to automate the extraction of the ‘invoice Number’, ‘Client Name’, and the ‘Total Amount’. You receive the invoice in an image format, now, to extract text from an image you would need an OCR (Optical Character Recognition) software, if you don’t know, an OCR is the conversion of images of text (typed, handwritten, or printed) into machine-encoded text.
When you run an OCR on a document image the output is the machine-encoded text that the OCR was able to recognize. As shown below, the chunk of text doesn’t necessarily preserve the geometric positions of the words, spaces, and style. So it is hard to identify and locate the key information that we need.
Processing the chunk of text can take a lot of time and require a lot of regex specifications or NLP, especially for documents with a lot of text like contracts. One solution when the document always respects a template and we know the positions of the zones of interest is Template Matching.
Template Matching is a technique based on the geometric coordinates of regions of interest (ROI). One condition is to have a static positioning of the zones of interest, let’s say we know for sure that the ‘Total Amount’ in the invoice is always positioned within a zone with coordinates (x, y, length, height). In this case, we can directly OCR that specific zone and get the output text, this will not only reduce the processing time of the entire document but also get structured output that we can directly insert into a database or other service that requires that data.
How to build your Template Matching solution
The first step to TM is to create a template, which means getting the representation of the zones of interest. if we were to present a template for the Invoice Image above it would look like this in a JSON file:
As you can see, the template has three zones, each zone has its coordinates and a title, and the template width and height are also important. If the image size is different from the template, you need to convert it into the same size as the template because the zones positioning is related to the size of the template. The JSON presented above can have more components like page number, if you receive a document in pdf of multiple pages and you convert it into images. The JSON can also contain some logic related to each zone. I explained in a previous article how you can improve the output of an OCR. Let’s say you need to zoom the zone image a little bit to improve the readability, then you might add the ‘zoom’ parameter. Or let’s say your OCR reads ‘o’ letter instead of ‘0’ number in a zone where you only have numbers then you might need to add corrections to that specific zone. Or you want to add a regex rule to that zone to measure the accuracy of your output … there are a lot of configurations that you can add to your template, a zone with multiple parameters would look like this:
Now that we defined our template, we need to extract single images for each zones respecting its coordinates, this step might be skipped if the OCR engine you’re using can accept ROI parameters. Otherwise, if you’re using an OCR engine like Tesseract then you need to provide the cropped images of the zones directly to OCR.
For the cropping of the image there exist a lot of open-source software that you can use, a famous library is ImageMagick, and if you are running your code on python you can use some ImageMagick binders like Wand.
Then, the last step is to read the cropped images, you can use pytesseract to read the image, you can structure your output like this:
Limitations of Template Matching
TM is a great solution, straightforward and easy to implement, it works perfectly, especially for documents that are electronically generated, where documents quality and format are intact.
But as you might have figured out by now, TM is limited to documents that respect a template, where the positions of the zones are static, for example, if your documents are slightly rotated or misplaced on a scanner resulting in the altering of the position of a zone, the output won’t be correct.
Every change in the document like the style requires the reconfiguration of the template for your OCR. There exist also challenges when the positions of certain zones change with content, for example, more elements in the table of the invoice shown above can result in the offset of the ‘Total amount’ zone.
Another drawback of this method is that you need to know what document you are dealing with to assign it to its corresponding template, if you are receiving documents from different organizations, then you need multiple template configurations for each type, and you need to have a classifier that decides what template to use on what document if the type is not sent to you.
I hope this article was helpful. I’ll be sharing other techniques of IE, so consider following me for more.