Last modified: Jan 11, 2025 By Alexander Williams

Python PdfReader.getDocumentInfo: Extract PDF Metadata

Working with PDFs in Python is made easy with libraries like PdfReader. One useful feature is the getDocumentInfo method. It helps you extract metadata from PDF files.

Metadata includes details like the title, author, and creation date. This article will show you how to use getDocumentInfo effectively.

What is PdfReader.getDocumentInfo?

The getDocumentInfo method is part of the PdfReader class. It retrieves metadata from a PDF file. This metadata is stored in a dictionary format.

Common metadata fields include Title, Author, and CreationDate. These details can be useful for organizing and analyzing PDFs.

How to Use PdfReader.getDocumentInfo

To use getDocumentInfo, you first need to install the PdfReader library. Follow our step-by-step guide for installation instructions.

Once installed, you can start extracting metadata. Below is an example code snippet:


from PyPDF2 import PdfReader

# Load the PDF file
reader = PdfReader("example.pdf")

# Get document information
info = reader.getDocumentInfo()

# Print metadata
print(info)

This code loads a PDF file and retrieves its metadata. The output will be a dictionary containing the document's metadata.

Example Output

Here is an example of what the output might look like:


{
    '/Title': 'Sample PDF',
    '/Author': 'John Doe',
    '/CreationDate': 'D:20231010120000',
    '/Producer': 'PDF Producer'
}

Each key in the dictionary represents a metadata field. The values are the corresponding metadata details.

Common Metadata Fields

Here are some common metadata fields you might encounter:

  • Title: The title of the document.
  • Author: The author of the document.
  • CreationDate: The date the document was created.
  • Producer: The software used to create the PDF.

These fields can vary depending on the PDF. Some PDFs may have additional or fewer metadata fields.

Handling Missing Metadata

Not all PDFs have metadata. If a field is missing, the dictionary will not include it. You can check for missing fields using conditional statements.

Here’s how you can handle missing metadata:


if '/Title' in info:
    print(f"Title: {info['/Title']}")
else:
    print("Title not available")

This code checks if the Title field exists. If it does, it prints the title. Otherwise, it prints a message indicating the title is missing.

Conclusion

The getDocumentInfo method is a powerful tool for extracting metadata from PDFs. It helps you access important details like the title, author, and creation date.

By following this guide, you can easily retrieve and handle PDF metadata in Python. For more advanced PDF operations, check out our articles on extracting text and counting pages.

Start using getDocumentInfo today to enhance your PDF processing tasks!