Improve the quality of your OCR information extraction

From Image to Text Transformation
  • Control the input
  • Image preprocessing
  • Tesseract options
  • Custom training Tesseract
  • Text postprocessing

1. Control the input

2. Image Preprocessing

Rescaling

Tesseract result before rescaling: Cakhestew Umted v Astanwfla
Tesseract result after rescaling: Colchester United v Aston Villa
Resize image x2 with OpenCV in python

Image binarization

Add binary threshold with OpenCV
Tesseract output with binarization vs colored image

Image noise reduction

Noise removal

Image rotation or deskewing

OpenCV to detect the angle and then rotate the image
Before and After skewing

Border and background removal

Cropping the background

Alpha channel or transparency

Imagick command to remove alpha channel

3. Tesseract options

Page segmentation ( psm)

psm options
Two lines We want to read
Tesseract with psm 6
The output of tesseract with segmentation parameter vs with no segmentation

Language (l)

Calling the french script model

4. Custom training Tesseract

Challenging font

5. Text postprocessing

--

--

--

Software Engineer | Technical Writer | IT Enthusiast

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

QuickTip: Setting up a fully functional FTP server on Ubuntu

Choosing a Cloud Provider — A Checklist

Making you feel SAFe®

Drupal 9 is almost here! Here’s what you should know

Useful Python Libraries

Salesforce Flows — A How-to Guide — CLOUDBYZ

How Mage uses AWS to build an AI tool for product developers

Transferring files between server and local machine.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aicha Fatrah

Aicha Fatrah

Software Engineer | Technical Writer | IT Enthusiast

More from Medium

A Better Way to Process Images for OCR

Text Detection in Spark OCR

[In-Depth Tutorial] Tesseract OCR in Python with Pytesseract & OpenCV

Custom Named Entity Recognition Using PyCRFSuite