Last modified: Jan 28, 2026 By Alexander Williams
Convert PDF to HTML with Python | Easy Guide
Need to turn a PDF into a webpage? Python can help. This guide shows you how. We will use powerful libraries. You will get clean, usable HTML.
Converting PDFs is a common task. It makes documents web-friendly. Python offers several tools for this job. We will explore the best options.
Why Convert PDF to HTML?
PDFs are great for printing. They keep formatting fixed. But they are not ideal for the web. HTML is flexible and interactive.
Converting to HTML makes content searchable. It improves accessibility. It allows content to fit any screen. This is crucial for mobile users.
You might need this for a content management system. Or to display reports online. Python automates the process easily.
Top Python Libraries for PDF to HTML
Two main libraries stand out. They are pdf2htmlEX and PyMuPDF (fitz). Each has different strengths.
pdf2htmlEX is a command-line tool. It creates very accurate HTML. It preserves layout and fonts beautifully. You call it from Python using the subprocess module.
PyMuPDF is a pure Python library. It is great for extracting text and images. It gives you more control over the output. You can build custom HTML.
Method 1: Using pdf2htmlEX
First, install pdf2htmlEX on your system. It is not a Python package. On Ubuntu, use apt-get. On macOS, use Homebrew.
Then, use Python to run it. The subprocess.run() function is key. Here is a simple script.
import subprocess
import os
def convert_pdf_to_html_pdf2html(pdf_path, output_dir="output"):
"""
Converts a PDF file to HTML using the pdf2htmlEX command-line tool.
"""
# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)
# Define the output HTML file path
output_file = os.path.join(output_dir, "converted.html")
# Build the command
# '--zoom 1.5' adjusts scaling, '--dest-dir' sets output folder
command = [
"pdf2htmlEX",
"--zoom", "1.5",
"--dest-dir", output_dir,
pdf_path,
output_file
]
try:
# Execute the command
result = subprocess.run(command, check=True, capture_output=True, text=True)
print("Conversion successful!")
print(f"HTML saved to: {output_file}")
return output_file
except subprocess.CalledProcessError as e:
print(f"Conversion failed: {e.stderr}")
return None
except FileNotFoundError:
print("Error: pdf2htmlEX is not installed or not in your system's PATH.")
return None
# Example usage
if __name__ == "__main__":
convert_pdf_to_html_pdf2html("sample_report.pdf")
This script runs the external tool. It saves an HTML file. The output keeps the PDF's look. It uses CSS for positioning.
Method 2: Using PyMuPDF (fitz)
PyMuPDF gives you data from the PDF. You then write the HTML yourself. This is good for text-heavy documents.
First, install the library. Use pip.
pip install PyMuPDF
Now, let's write a converter. It extracts text and basic formatting.
import fitz # PyMuPDF
def convert_pdf_to_html_pymupdf(pdf_path, output_path="output_pymupdf.html"):
"""
Extracts text from a PDF and creates a simple HTML file.
"""
# Open the PDF document
doc = fitz.open(pdf_path)
html_content = ["<!DOCTYPE html>", "<html>", "<head>", "<title>Converted PDF</title>", "</head>", "<body>"]
# Iterate through each page
for page_num in range(len(doc)):
page = doc.load_page(page_num)
# Extract text as HTML (preserves some formatting)
text = page.get_text("html")
html_content.append(f"<div class='page' id='page-{page_num+1}'>")
html_content.append(text)
html_content.append("</div>")
html_content.append("<hr>") # Add a separator between pages
html_content.append("</body>")
html_content.append("</html>")
# Write the HTML to a file
with open(output_path, "w", encoding="utf-8") as f:
f.write("\n".join(html_content))
print(f"HTML file created: {output_path}")
doc.close()
return output_path
# Example usage
if __name__ == "__main__":
convert_pdf_to_html_pymupdf("sample_report.pdf")
The get_text("html") method is powerful. It returns text with simple HTML tags. This includes <p> and <b> tags.
The output is a basic HTML file. You can style it with CSS. For more complex needs, extract images and links too.
Handling Complex PDFs with Tables
PDFs often contain tables. Converting them to HTML tables is tricky. PyMuPDF can help identify table structures.
You can then build proper HTML table tags. This ensures data is structured correctly. For advanced table styling, you might want to create Python HTML Tables That Look Like Excel Spreadsheets.
Here is a basic example of extracting table data.
import fitz
def extract_tables_to_html(pdf_path):
doc = fitz.open(pdf_path)
all_tables_html = []
for page in doc:
# Find tables on the page (this is a simplified approach)
tabs = page.find_tables()
if tabs.tables:
for table in tabs.tables:
html_table = ["<table border='1'>"]
for row in table.extract():
html_table.append("<tr>")
for cell in row:
html_table.append(f"<td>{cell if cell else ''}</td>")
html_table.append("</tr>")
html_table.append("</table>")
all_tables_html.append("\n".join(html_table))
doc.close()
return "\n<br>\n".join(all_tables_html)
# Get the HTML for tables
table_html = extract_tables_to_html("sample_with_tables.pdf")
print(table_html)
Choosing the Right Method
Which method should you use? It depends on your goal.
Use pdf2htmlEX for pixel-perfect conversion. It is best for flyers or complex layouts. The output is a single, large HTML file with embedded CSS.
Use PyMuPDF for data extraction. It is best for text, tables, and images. You get clean data to put into your own HTML template. This is great for web apps.
Common Challenges and Solutions
You might face some issues. Here are common ones.
Missing Fonts: The HTML might look wrong. pdf2htmlEX embeds fonts. PyMuPDF does not. For PyMuPDF, use web-safe fonts in your CSS.
Complex Layouts: Columns and sidebars can break. pdf2htmlEX handles them well. With PyMuPDF, you may need to process the text blocks manually.
Large Files: Conversion can be slow. Split the PDF into pages. Convert one page at a time. This helps with memory.
Conclusion
Converting PDF to HTML with Python is straightforward. You have two excellent paths.
For visual fidelity, use pdf2htmlEX via subprocess. For data control, use PyMuPDF to extract and build HTML.
Start with a simple PDF. Try both methods. See which output fits your project. Automating this task saves time and makes your content web-ready.
Remember, the goal is usable web content. Whether it's a report or an article, HTML makes it accessible to all. Happy coding!