Don't let the Lockdown slow you Down - Enroll Now and Get 3 Course at 24,999/- Only. Explore Now!

General

How to Implement Optical Character Recognition in Python

How to Implement Optical Character Recognition in Python

Introduction

Optical Character Recognition is one of the important factors in the Python programming language. There a lot of applications in the world with these types of concepts. Today in this tutorial, we will have a complete overview of the Optical Character Recognition.

How to create an Optical Character Recognition in Python programming language?

Let’s make use of the “pytesseract” to create a class. This class helps to ingress photos and scan them. You can also make use of the extensions named “ocr.py” to process the output file. The “processor_image” function block is used for text sharpening. The view function and route handler are added to the app.py applications.

Let’s check out the Router Handler code and OCR Engine code below.

Router Handler Code:

//ROUTE HANDLER
@app.route('/v{}/ocr'.format(_VERSION), methods=["POST"])
def ocr():
try:
url = request.json['image_url']
if 'jpg' in url:
output = process_image(url)
return jsonify({"output": output})
else:
return jsonify({"error": "only .jpg files, please"})
except:
return jsonify(
{"error": "Did you mean to send: {'image_url': 'some_jpeg_url'}"}
)

OCR Engine Code:

// OCR ENGINE
import pytesseract
import requests
from PIL import Image
from PIL import ImageFilter
from StringIO import StringIO
def process_image(url):
image = _get_image(url)
image.filter(ImageFilter.SHARPEN)
return pytesseract.image_to_string(image)
 def _get_image(url):
return Image.open(StringIO(requests.get(url).content))

You should add an API version number as well as update the imports.

import os
import logging
from logging import Formatter, FileHandler
from flask import Flask, request, jsonify
 from ocr import process_image
_VERSION = 1  # API version

In this, I’m adding “process_image(),” one of the OCR Engine functions in JSON response. JSON is used to collect data that is entering in and out of the API. We make use of the “image” library from PIL so that it’s easy to pass the response in the object file and then install them.

The above code will suit perfectly for .jpg images only. In case of using any complex library that can feature different formats in images, then in this part, every image can be effectively and easily processed. If you are interested in writing this code on your own, then you need to make sure whether you have installed PIL.

You need to start this by running “app.py,” the applications.

//
$ cd ../home/flask_server/
$ python app.py
//

Now choose another terminal and then run the following

//$ curl -X POST http://localhost:5000/v1/ocr -d '{"image_url": "some_url"}
'-H "Content-Type: application/json"

Let’s consider an example,

//
$ curl -X POST http://localhost:5000/v1/ocr -d '{" C:UsersvivDownloadsPic1 ":
"<a href="https://besanttechnologies.com/images/blog_images/ocr/ocr.jpg">
https://besanttechnologies.com/images/blog_images/ocr/ocr.jpg</a>"}' -H "Content-Type: 
application/json"
{
  "output": "ABCDEnFGH I JnKLMNOnPQRST"
}
//

Let me explain this with the image.

Input Image: Following is the input image that should be converted to the digital text.

Input Image

Output Image: Following is the output we receive.

Output Image

Applications of Optical Character Recognition

There are many applications that Optical Character Recognition is used to. Here is one example.

Ticket counter makes use of the Optical Character Recognition for detection and scanning the important data on the ticket to identify the commuter detail as well as routes. Conversion of digital formats from the paper text where the cameral clicks high-resolution images and then the Optical Character Recognition help to bring them into a PDF or word format.

The OCR introduction with Python is endorsed to the addition of “Orcad” and “Tesseract,” which are the powerful, versatile libraries. This library enables every developer and coder to make the code design easier and enable them to invest their more time on other important factors of their projects. Apart from this, there are plenty of applications that make use of Optical Character Recognition.

Now let’s check another example to implementing the Optical Character Recognition in Python in depth.

How to read PDF content using OCR in Python

Python provides different libraries to convert PDF to text format. Let’s look at the process in detail.The primary goal of converting PDF to text is, we need to convert the PDF pages to images, and we should make use of the Optical Code Recognition to read the image content and then store it as a file (text format).

We need to following installations.

  • pip3 install PIL
  • pip3 install pytesseract
  • pip3 install pdf2image
  • sudo apt-get install tesseract-ocr

We can deal with this program using two important processes.

Process 1:

The first part deals with the conversion of PDF to images. Every PDF page is now made to store as an image file. Let’s store the name of the images as

PDF page 1→ pg_1jpg

PDF page2 → pg_2.jpg

……

Pdf page n → pg_n.jpg

Here is the implementation of process 1

# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
 # Path of the pdf
PDF_file = "f.pdf"

Converting PDF to images

pages = convert_from_path(PDF_file, 500) #store the PDF page in a variable. 
image_counter = 1 #counter to store every PDF page to image.  
# Iterate through all the pages stored above
for page in pages:
# PDF page 1 -> pg_1.jpg
# PDF page 2 -> pg_2.jpg
# PDF page 3 -> pg_3.jpg
# ....
# PDF page n -> pg_n.jpg
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1 #incrementing the counter so that 
filename can be updated.

Process 2:

The second part deals with identifying the text from the converted image file and them storing the information as a text file. In this part, we are going to process those images and then convert them to text. We will be able to do different text processing once we have text as a string variable.

For example,

You are writing a line, and let’s say you are not able to complete a word in one line. In this case, you will make use of the hyphen (_) so that the word is read as a continuous text.

For example

I am a programmer with a good knowledge of different programming languages. I am ready to face any interview.

For this kind of word, we will do a pre-processing. We will make the new line and hyphen into a complete word. Once the preprocessing is completed, the text will be stored in a different text file. If you need to get the input PDF files that are used in the code, you need to click f.pdf

Here is the implementation of the second part.

# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
 # Path of the pdf
PDF_file = "f.pdf"

Identifying text from the images using OCR

3
filelimit = image_counter-1 #this is the variable to get all pages count. 
outfile = "out_text.txt" #, we create a text file order to deliver the output. 
 f = open(outfile, "a") #we are going to open every file in append so that every image content is added to a similar file. 
 for i in range(1, filelimit + 1): #we are gonna iteration to total pages from one. 
# Set filename to recognize text from
# Again, these files will be:
# pg_1.jpg
# pg_2.jpg
# ....
# pg_n.jpg
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))    
text = text.replace('-\n', '')          
f.write(text) #writing the processed text to a text file. 
f.close() #closing the file after completing writing every text. 

Input file:

Input File

Output file

Output File

Benefits and Drawbacks of OCR Engine:

Handwriting recognition is one of the important applications of making use of the OCR in Python. It’s also used to convert PDF to text, and also stores those values as variables. When it comes to the drawbacks, they are not assured of 100% accuracy. In some cases, when using the AI concepts, there are chances for the OCR to result in poor images. Handwriting images results differs based on the different aspects like page color, image contrast, writing style, and image resolution.I hope you are clear about implementing the optical character recognition in Python. If you have any queries, let us know in the comment section below.

Related Blogs:

  1. Brief Overview of Python Language
  2. Python Career opportunities
  3. Python Break Continue
  4. Python Control Flow
  5. Python Data Types
  6. Python Dictionary
  7. Python Exception Handling
  8.  Python File
  9. Python Functions
  10. Python Substring

 

Scroll Up
Besant Technologies WhatsApp