Last modified: Feb 07, 2026 By Alexander Williams
Python PDF Libraries Guide | Create & Read PDFs
PDFs are everywhere. They are the standard for documents that need to look the same on any device.
But working with them programmatically can be tricky. Python makes it easier.
This guide explores the best Python libraries for PDF tasks. You will learn how to create, read, and edit PDF files.
Why Use Python for PDF Tasks?
Python is a powerful and simple language. It has a rich ecosystem of libraries for almost any task.
For PDFs, these libraries handle complex formats. They let you automate document workflows.
You can generate reports, extract data, merge files, and add security. All with a few lines of Python code.
Top Python PDF Libraries
Different libraries serve different purposes. Some are great for creating PDFs from scratch.
Others excel at reading and extracting content. Let's look at the most popular ones.
1. PyPDF2 (Now PyPDF4)
PyPDF2 is a classic. It is a pure-Python library for basic PDF operations.
It is perfect for tasks like merging, splitting, and rotating pages. You can also add watermarks.
It is less ideal for creating complex new documents or extracting text with perfect layout.
Note: The original PyPDF2 is no longer maintained. The community-forked version is called PyPDF4.
# Example: Merging two PDFs with PyPDF2
import PyPDF2
# Create a PDF merger object
merger = PyPDF2.PdfFileMerger()
# Append the PDF files
merger.append('document1.pdf')
merger.append('document2.pdf')
# Write out the merged PDF
merger.write('merged_output.pdf')
merger.close()
print("PDFs merged successfully!")
2. ReportLab
ReportLab is the industry standard for generating PDFs from scratch.
It gives you pixel-level control. You can place text, images, and shapes exactly where you want.
It is used for invoices, reports, and certificates. For a deep dive, see our Python PDF Generator Guide.
# Example: Creating a simple PDF with ReportLab
from reportlab.pdfgen import canvas
# Create a canvas object
c = canvas.Canvas("hello_reportlab.pdf")
# Draw a string 100 points from the bottom and left
c.drawString(100, 750, "Hello, World!")
c.drawString(100, 730, "Generated with ReportLab.")
# Save the PDF
c.save()
# Output: A PDF file named 'hello_reportlab.pdf' is created.
3. pdfplumber
pdfplumber is excellent for data extraction. It focuses on reading and analyzing PDFs.
It provides detailed information about every character, line, and table on a page.
This makes it superior for extracting text with its position or pulling data from tables. Learn more in our Python PDF Reader Guide.
# Example: Extracting text and table data with pdfplumber
import pdfplumber
with pdfplumber.open('sample_invoice.pdf') as pdf:
first_page = pdf.pages[0]
# Extract all text
text = first_page.extract_text()
print("Extracted Text:")
print(text[:300]) # Print first 300 characters
# Extract tables
tables = first_page.extract_tables()
for i, table in enumerate(tables):
print(f"\nTable {i+1}:")
for row in table:
print(row)
4. PDFMiner.six
PDFMiner.six is a tool for extracting text and metadata. It is very powerful for parsing.
It can analyze the layout of text and give you the exact location of elements. It's great for complex document analysis.
For specialized parsing tasks, check out our Python PDF Parser Guide.
Choosing the Right Library
Your choice depends on your goal. Here is a simple guide.
Use ReportLab to create new, styled PDFs from scratch.
Use PyPDF2/PyPDF4 for simple editing tasks on existing PDFs.
Use pdfplumber or PDFMiner.six to extract text, tables, and data accurately.
Many projects use a combination. You might generate with ReportLab and then merge with PyPDF2.
Common Tasks and Examples
Extracting Metadata
PDFs contain hidden information like author and title. This is called metadata.
You can read it with PyPDF2. For extracting specific XMP metadata, the method getXmpMetadata() is useful.
# Example: Reading PDF metadata with PyPDF2
import PyPDF2
with open('document.pdf', 'rb') as file:
reader = PyPDF2.PdfFileReader(file)
info = reader.getDocumentInfo()
print("PDF Metadata:")
print(f"Title: {info.title}")
print(f"Author: {info.author}")
print(f"Subject: {info.subject}")
print(f"Number of Pages: {reader.numPages}")
Working with Bookmarks and Outlines
Bookmarks help navigate long PDFs. You can read them from existing files.
Use the getOutlines() method. You can also add new ones when creating or editing a PDF.
The add_bookmark() method in a PdfWriter object is used for this.
Handling PDF Forms
Interactive PDFs can have form fields. You can extract data users have entered.
The getFields() method retrieves all form field data at once. This is crucial for processing surveys or applications.
Best Practices and Tips
Always open PDF files in binary mode ('rb' for read, 'wb' for write).
Close file objects or use the `with` statement to prevent memory leaks.
PDF is a complex format. Be prepared for edge cases and malformed files.
For advanced tasks like converting pages to images, specialized libraries exist. Our Python PDF to Image Conversion Guide covers this.
Conclusion
Python offers a powerful toolkit for PDF manipulation. You can automate document creation, data extraction, and file management.
Start with PyPDF4 for basic editing. Use ReportLab for professional-grade PDF generation.
Choose pdfplumber for reliable text and table extraction. Combine libraries to build robust document pipelines.
The key is to pick the right tool for your specific task. With these libraries, you can handle almost any PDF challenge.