Tesseract initial setup

Install Tesseract:

sudo apt install tesseract-ocr

Install python3-pil

sudo apt-get install -y python3-pil

Insall pytesseract:

pip install pytesseract

Testing Tesseract.

Call the Tesseract engine on the image with image_path and convert image to text, written line by line in the command prompt by typing the following:

$ tesseract image_path stdout

To write the output text in a file:

$ tesseract image_path text_result.txt

To specify the language model name, write language shortcut after -l flag, by default it takes English language:

$ tesseract image_path text_result.txt -l eng

By default, Tesseract expects a page of text when it segments an image. If you’re just seeking to OCR a small region, try a different segmentation mode, using the –psm argument. There are 14 modes available which can be found here. By default, Tesseract fully automates the page segmentation but does not perform orientation and script detection. To specify the parameter, type the following:

$ tesseract image_path text_result.txt -l eng --psm 6

There is also one more important argument, OCR engine mode (oem). Tesseract 4 has two OCR engines — Legacy Tesseract engine and LSTM engine. There are four modes of operation chosen using the –oem option.
0    Legacy engine only.
1    Neural nets LSTM engine only.
2    Legacy + LSTM engines.
3    Default, based on what is available.

Adding Estonian language for ubuntu package:

sudo apt-get install tesseract-ocr-[lang]

Compiling Tesseract:

To run pytesseract in python script:

pip install pytesseract

Preprocessing for Tesseract

pip install opencv-python

To avoid all the ways your tesseract output accuracy can drop, you need to make sure the image is appropriately pre-processed.

This includes rescaling, binarization, noise removal, deskewing, etc.

To preprocess image for OCR, use any of the following python functions or follow the OpenCV documentation.

Improve quality of the output (Official doc)