Last modified: Jan 10, 2025 By Alexander Williams

Extract Text from PDFs with Python PdfReader

Working with PDFs in Python is common. Extracting text is a key task. The PdfReader.extract_text() method makes it easy.

This article explains how to use PdfReader.extract_text(). It includes examples and tips for beginners.

What is PdfReader.extract_text()?

The PdfReader.extract_text() method extracts text from PDF pages. It is part of the PyPDF2 library. This method is simple and efficient.

Before using it, ensure PyPDF2 is installed. If not, follow this step-by-step guide.

How to Use PdfReader.extract_text()

First, import the PyPDF2 library. Then, open the PDF file. Use PdfReader.extract_text() to extract text.

Here’s an example:


    import PyPDF2

    # Open the PDF file
    with open('example.pdf', 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        
        # Extract text from the first page
        text = reader.pages[0].extract_text()
        print(text)
    

This code opens a PDF file. It extracts text from the first page. The output is printed to the console.

Example Output

Here’s what the output might look like:


    This is an example PDF document.
    It contains text that can be extracted using Python.
    

The output is the text from the PDF. It can be saved or processed further.

Handling Multiple Pages

To extract text from all pages, loop through them. Use PdfReader.getNumPages() to count pages. Then, extract text from each page.

Here’s how:


    import PyPDF2

    # Open the PDF file
    with open('example.pdf', 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        
        # Loop through all pages
        for page_num in range(len(reader.pages)):
            text = reader.pages[page_num].extract_text()
            print(f"Page {page_num + 1}:\n{text}\n")
    

This code extracts text from every page. It prints the text with the page number.

Common Errors and Fixes

Sometimes, you may encounter errors. One common error is No Module Named PdfReader. This happens if PyPDF2 is not installed.

To fix it, install PyPDF2. Follow this guide.

Conclusion

Extracting text from PDFs is easy with Python. The PdfReader.extract_text() method is powerful and simple. Use it to process PDFs efficiently.

For more advanced tasks, explore PdfReader.getPage() and PdfReader.getNumPages().