Last modified: Apr 12, 2025 By Alexander Williams
Python OCR: Extract Text from Images Easily
OCR (Optical Character Recognition) converts images with text into machine-readable text. Python makes it easy with powerful libraries.
This guide will show you how to extract text from images using Python. We'll cover installation, basic usage, and practical examples.
Table Of Contents
Why Use Python for OCR?
Python is perfect for OCR tasks. It has simple syntax and powerful libraries. You can process images quickly and accurately.
Common uses include digitizing documents, automating data entry, and processing receipts. It's also great for text extraction from PDFs when combined with conversion tools.
Installing Required Libraries
First, install pytesseract
and Pillow
. These are the main libraries for OCR in Python.
# Install required packages
pip install pytesseract Pillow
You'll also need Tesseract OCR engine. Install it from the official GitHub repository for your operating system.
Basic Image to Text Conversion
Here's a simple script to extract text from an image. We'll use Pillow
to open the image and pytesseract
for OCR.
from PIL import Image
import pytesseract
# Open the image file
image = Image.open('sample.jpg')
# Perform OCR
text = pytesseract.image_to_string(image)
print(text)
This is sample text extracted from an image.
Line two of the sample text.
The image_to_string
function does all the hard work. It returns the extracted text as a string.
Improving OCR Accuracy
OCR accuracy depends on image quality. Here are ways to improve results:
1. Use high-resolution images
2. Ensure proper lighting
3. Pre-process images with Python image segmentation techniques
You can also pre-process images before OCR. Try cropping, resizing, or enhancing contrast.
Advanced OCR Techniques
For more control, specify OCR parameters. You can set language, page segmentation mode, and more.
text = pytesseract.image_to_string(
image,
lang='eng',
config='--psm 6 --oem 3'
)
PSM (Page Segmentation Mode) helps with layout analysis. OEM (OCR Engine Mode) selects the recognition algorithm.
Handling Multiple Languages
Tesseract supports many languages. Download additional language data files as needed.
# Extract text in Spanish
text = pytesseract.image_to_string(image, lang='spa')
Combine this with Python text extraction techniques for multilingual documents.
Processing Multiple Images
You can batch process multiple images. This is useful for digitizing documents or receipts.
import os
for filename in os.listdir('images/'):
if filename.endswith(('.jpg', '.png')):
image = Image.open(f'images/{filename}')
text = pytesseract.image_to_string(image)
print(f'{filename}:\n{text}\n')
Conclusion
Python OCR is powerful for extracting text from images. With pytesseract
and proper image pre-processing, you can achieve great results.
Remember to check image quality and experiment with settings. For more advanced tasks, combine OCR with other techniques like Python image recognition.
Start with simple images and gradually tackle more complex documents. The possibilities are endless with Python OCR!