Last modified: Jan 11, 2025 By Alexander Williams

Python PdfReader.getXmpMetadata: Extract PDF Metadata

Working with PDFs in Python is made easy with libraries like PdfReader. One powerful feature is the getXmpMetadata method. It helps extract XMP metadata from PDF files.

XMP metadata contains valuable information about the PDF. This includes details like author, title, and creation date. Extracting this data can be crucial for document management and analysis.

What is XMP Metadata?

XMP stands for Extensible Metadata Platform. It is a standard for storing metadata in files. In PDFs, XMP metadata can include information like the document's title, author, and keywords.

This metadata is often embedded in the PDF file. It can be accessed programmatically using Python. The getXmpMetadata method is designed for this purpose.

How to Use PdfReader.getXmpMetadata

To use getXmpMetadata, you first need to install the PdfReader library. If you haven't installed it yet, follow our step-by-step guide.

Once installed, you can start extracting XMP metadata. Below is an example of how to use the getXmpMetadata method.


from PyPDF2 import PdfReader

# Load the PDF file
reader = PdfReader("example.pdf")

# Extract XMP metadata
xmp_metadata = reader.getXmpMetadata()

# Print the metadata
print(xmp_metadata)
    

In this example, we load a PDF file named example.pdf. We then use the getXmpMetadata method to extract the XMP metadata. Finally, we print the metadata to the console.

Example Output

Here is an example of what the output might look like:


{'xmp:CreatorTool': 'Adobe Acrobat Pro DC', 'dc:title': 'Sample Document', 'dc:creator': 'John Doe'}
    

The output is a dictionary containing key-value pairs. Each pair represents a piece of metadata. For example, dc:title is the document's title, and dc:creator is the author.

Common Use Cases

Extracting XMP metadata is useful in many scenarios. For example, you might need to organize a large collection of PDFs. Metadata can help you sort and categorize these documents.

Another use case is document analysis. By extracting metadata, you can gain insights into the document's origin and history. This can be particularly useful in legal or academic settings.

If you're also interested in extracting other types of data from PDFs, check out our guides on extracting document info and extracting text.

Conclusion

The getXmpMetadata method in Python's PdfReader is a powerful tool. It allows you to extract XMP metadata from PDF files with ease. This metadata can be invaluable for document management and analysis.

By following the steps outlined in this article, you can start extracting metadata from your PDFs today. For more advanced PDF manipulation, explore other features like extracting form data or extracting outlines.