Last modified: Jan 11, 2025 By Alexander Williams

Python PdfReader.getOutlines: Extract PDF Outlines

Working with PDFs in Python is made easy with libraries like PdfReader. One useful feature is the getOutlines method. It helps you extract the outlines or bookmarks from a PDF.

This article will guide you through using getOutlines. You'll learn how to extract and navigate PDF outlines. We'll also provide example code and outputs.

What is PdfReader.getOutlines?

The getOutlines method is part of the PdfReader library. It retrieves the outline structure of a PDF. Outlines are often used as bookmarks or a table of contents.

This method returns a list of outline objects. Each object contains details like the title and destination page. It's useful for navigating large PDFs programmatically.

How to Use PdfReader.getOutlines

To use getOutlines, you first need to install the PdfReader library. If you haven't installed it yet, check out our step-by-step guide.

Once installed, you can start extracting outlines. Below is an example code snippet:


from PdfReader import PdfReader

# Load the PDF file
pdf = PdfReader("example.pdf")

# Get the outlines
outlines = pdf.getOutlines()

# Print the outlines
for outline in outlines:
    print(f"Title: {outline.title}, Page: {outline.page}")
    

This code loads a PDF file and extracts its outlines. It then prints the title and page number of each outline.

Example Output

Here’s what the output might look like for a sample PDF:


Title: Introduction, Page: 1
Title: Chapter 1, Page: 5
Title: Chapter 2, Page: 10
    

This output shows the titles and corresponding page numbers of the outlines. It helps you understand the structure of the PDF.

Why Use PdfReader.getOutlines?

Extracting outlines is useful for many tasks. For example, you can create a table of contents or navigate large documents. It’s also helpful for automating PDF processing tasks.

If you're working with PDF forms, you might also want to check out how to extract form text fields.

Handling Complex Outlines

Some PDFs have nested outlines. These are outlines with sub-outlines. The getOutlines method can handle these as well.

Here’s an example of extracting nested outlines:


def print_outlines(outlines, level=0):
    for outline in outlines:
        print(f"{'  ' * level}Title: {outline.title}, Page: {outline.page}")
        if hasattr(outline, 'children'):
            print_outlines(outline.children, level + 1)

# Load the PDF file
pdf = PdfReader("nested_example.pdf")

# Get and print nested outlines
print_outlines(pdf.getOutlines())
    

This code recursively prints nested outlines. It indents sub-outlines for better readability.

Conclusion

The getOutlines method is a powerful tool for working with PDFs in Python. It helps you extract and navigate outlines easily. This is especially useful for large documents.

If you're interested in other PDF-related tasks, check out our guides on extracting PDF metadata and counting PDF pages.

Start using getOutlines today to make your PDF processing tasks easier and more efficient!