Install Tesseract:
sudo apt install tesseract-ocr
Install python3-pil
sudo apt-get install -y python3-pil
Insall pytesseract:
pip install pytesseract
Testing Tesseract.
Call the Tesseract engine on the image with image_path and convert image to text, written line by line in the command prompt by typing the following:
$ tesseract image_path stdout
To write the output text in a file:
$ tesseract image_path text_result.txt
To specify the language model name, write language shortcut after -l flag, by default it takes English language:
$ tesseract image_path text_result.txt -l eng
By default, Tesseract expects a page of text when it segments an image. If you’re just seeking to OCR a small region, try a different segmentation mode, using the –psm argument. There are 14 modes available which can be found here. By default, Tesseract fully automates the page segmentation but does not perform orientation and script detection. To specify the parameter, type the following:
$ tesseract image_path text_result.txt -l eng --psm 6
There is also one more important argument, OCR engine mode (oem). Tesseract 4 has two OCR engines — Legacy Tesseract engine and LSTM engine. There are four modes of operation chosen using the –oem option.
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
Adding Estonian language for ubuntu package:
sudo apt-get install tesseract-ocr-[lang]
Compiling Tesseract:
To run pytesseract in python script:
pip install pytesseract
Preprocessing for Tesseract
pip install opencv-python
To avoid all the ways your tesseract output accuracy can drop, you need to make sure the image is appropriately pre-processed.
This includes rescaling, binarization, noise removal, deskewing, etc.
To preprocess image for OCR, use any of the following python functions or follow the OpenCV documentation.
Improve quality of the output (Official doc)