Last modified: Apr 12, 2025 By Alexander Williams

Python Text Extraction from Images Guide

Extracting text from images is a common task in data processing. Python makes it easy with OCR tools like Tesseract.

OCR stands for Optical Character Recognition. It converts images with text into machine-readable text.

Prerequisites

Before starting, install these Python libraries:


pip install pytesseract pillow

You also need to install Tesseract OCR engine. On Ubuntu:


sudo apt install tesseract-ocr

Basic Text Extraction

Here's how to extract text from an image using pytesseract:


from PIL import Image
import pytesseract

# Open image file
image = Image.open('sample.png')

# Extract text
text = pytesseract.image_to_string(image)

print(text)

This code opens an image and extracts all readable text. The image_to_string function does the OCR work.

Improving Accuracy

OCR accuracy depends on image quality. Preprocess images for better results.

First, convert the image to grayscale:


gray_image = image.convert('L')

You might also need to resize the image. Check our Python Resizing Images Guide for help.

Working with PDFs

For PDF files, first convert pages to images. See our Python PDF to Image Conversion Guide.

Then apply text extraction to each image.

Advanced Options

Tesseract supports many configuration options. Specify language:


text = pytesseract.image_to_string(image, lang='eng+fra')

You can also get bounding box information:


data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

Handling Different Image Formats

Python's PIL library supports many image formats. Learn more in our Python PIL Image Handling Guide.

For JPEG, PNG, BMP, and others, the process is the same.

Conclusion

Python makes text extraction from images simple. With Tesseract and some preprocessing, you can get good results.

Remember to clean your images first. Proper sizing and contrast improve accuracy significantly.

For more image processing tasks, explore our other Python guides.