Last modified: Jan 10, 2025 By Alexander Williams
Extract Text from PDFs with Python PdfReader
Working with PDFs in Python is common. Extracting text is a key task. The PdfReader.extract_text()
method makes it easy.
This article explains how to use PdfReader.extract_text(). It includes examples and tips for beginners.
What is PdfReader.extract_text()?
The PdfReader.extract_text()
method extracts text from PDF pages. It is part of the PyPDF2 library. This method is simple and efficient.
Before using it, ensure PyPDF2 is installed. If not, follow this step-by-step guide.
How to Use PdfReader.extract_text()
First, import the PyPDF2 library. Then, open the PDF file. Use PdfReader.extract_text()
to extract text.
Here’s an example:
import PyPDF2
# Open the PDF file
with open('example.pdf', 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Extract text from the first page
text = reader.pages[0].extract_text()
print(text)
This code opens a PDF file. It extracts text from the first page. The output is printed to the console.
Example Output
Here’s what the output might look like:
This is an example PDF document.
It contains text that can be extracted using Python.
The output is the text from the PDF. It can be saved or processed further.
Handling Multiple Pages
To extract text from all pages, loop through them. Use PdfReader.getNumPages()
to count pages. Then, extract text from each page.
Here’s how:
import PyPDF2
# Open the PDF file
with open('example.pdf', 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Loop through all pages
for page_num in range(len(reader.pages)):
text = reader.pages[page_num].extract_text()
print(f"Page {page_num + 1}:\n{text}\n")
This code extracts text from every page. It prints the text with the page number.
Common Errors and Fixes
Sometimes, you may encounter errors. One common error is No Module Named PdfReader. This happens if PyPDF2 is not installed.
To fix it, install PyPDF2. Follow this guide.
Conclusion
Extracting text from PDFs is easy with Python. The PdfReader.extract_text()
method is powerful and simple. Use it to process PDFs efficiently.
For more advanced tasks, explore PdfReader.getPage() and PdfReader.getNumPages().