Last modified: Apr 12, 2025 By Alexander Williams
Python Text Extraction from Images Guide
Extracting text from images is a common task in data processing. Python makes it easy with OCR tools like Tesseract.
OCR stands for Optical Character Recognition. It converts images with text into machine-readable text.
Prerequisites
Before starting, install these Python libraries:
pip install pytesseract pillow
You also need to install Tesseract OCR engine. On Ubuntu:
sudo apt install tesseract-ocr
Basic Text Extraction
Here's how to extract text from an image using pytesseract
:
from PIL import Image
import pytesseract
# Open image file
image = Image.open('sample.png')
# Extract text
text = pytesseract.image_to_string(image)
print(text)
This code opens an image and extracts all readable text. The image_to_string
function does the OCR work.
Improving Accuracy
OCR accuracy depends on image quality. Preprocess images for better results.
First, convert the image to grayscale:
gray_image = image.convert('L')
You might also need to resize the image. Check our Python Resizing Images Guide for help.
Working with PDFs
For PDF files, first convert pages to images. See our Python PDF to Image Conversion Guide.
Then apply text extraction to each image.
Advanced Options
Tesseract supports many configuration options. Specify language:
text = pytesseract.image_to_string(image, lang='eng+fra')
You can also get bounding box information:
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
Handling Different Image Formats
Python's PIL library supports many image formats. Learn more in our Python PIL Image Handling Guide.
For JPEG, PNG, BMP, and others, the process is the same.
Conclusion
Python makes text extraction from images simple. With Tesseract and some preprocessing, you can get good results.
Remember to clean your images first. Proper sizing and contrast improve accuracy significantly.
For more image processing tasks, explore our other Python guides.