Last modified: Feb 07, 2026 By Alexander Williams

Convert PDF to HTML with Python | Guide

Need to display PDF content on a website? Converting PDF to HTML is the solution. Python makes this task efficient and programmable.

This guide covers the best tools and methods. You will learn to extract text, images, and preserve formatting for the web.

Why Convert PDF to HTML?

PDFs are great for printing. They are not ideal for the web. HTML is the native language of browsers.

Converting to HTML improves accessibility. It makes content searchable and responsive. It also allows for easier interaction.

Common use cases include publishing reports, archiving documents, and creating web-friendly content from existing PDFs.

Core Python Libraries for PDF to HTML

Several Python libraries can handle PDF to HTML conversion. Each has different strengths.

PyMuPDF (fitz) is powerful for rendering and text extraction. pdf2htmlEX is a dedicated tool for high-fidelity conversion. pdfminer.six excels at detailed text and layout analysis.

Your choice depends on your needs. Do you need perfect visual layout or just the textual content?

Method 1: Using pdf2htmlEX via Python

pdf2htmlEX is a standalone tool. It creates HTML that looks very similar to the original PDF. You can call it from Python using the subprocess module.

First, ensure pdf2htmlEX is installed on your system. Then, you can run it from a Python script.


import subprocess
import os

def convert_pdf2htmlex(pdf_path, output_dir="output"):
    """
    Converts a PDF to HTML using the pdf2htmlEX command-line tool.
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Define the output HTML file path
    html_filename = os.path.splitext(os.path.basename(pdf_path))[0] + ".html"
    html_path = os.path.join(output_dir, html_filename)

    # Build the command
    # --zoom 1.5 adjusts the scale, --dest-dir sets output folder
    cmd = [
        "pdf2htmlEX",
        "--zoom", "1.5",
        "--dest-dir", output_dir,
        pdf_path,
        html_filename
    ]

    try:
        # Execute the command
        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
        print(f"Success! HTML saved to: {html_path}")
        print(result.stdout)
    except subprocess.CalledProcessError as e:
        print(f"Conversion failed with error: {e.stderr}")
    except FileNotFoundError:
        print("Error: pdf2htmlEX not found. Please install it.")

# Example usage
if __name__ == "__main__":
    convert_pdf2htmlex("sample_report.pdf")

This method is excellent for visual fidelity. The resulting HTML uses CSS and embedded images to mimic the PDF page.

Method 2: Extracting Text and Structure with PyMuPDF

PyMuPDF lets you extract text blocks and images. You can then structure this data into your own HTML. This offers more control over the final output.

You can use it to build a simple HTML page from the PDF's content. For more advanced PDF reading, see our Python PDF Reader Guide.


import fitz  # PyMuPDF

def convert_pymupdf_to_html(pdf_path, output_html="output_pymupdf.html"):
    """
    Extracts text from a PDF and wraps it in a basic HTML structure.
    """
    doc = fitz.open(pdf_path)
    html_content = ["
", "", "Converted PDF", ""]

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text("html")  # Get text in a simple HTML format

        html_content.append(f"Page {page_num + 1}")
        html_content.append("")
        html_content.append(text)
        html_content.append("")
        html_content.append("")

    html_content.append("")

    # Write the HTML to a file
    with open(output_html, "w", encoding="utf-8") as f:
        f.write("\n".join(html_content))

    print(f"Basic HTML file created: {output_html}")
    doc.close()

# Example usage
if __name__ == "__main__":
    convert_pymupdf_to_html("sample_document.pdf")

The page.get_text("html") method returns text with basic paragraph tags. This is a quick way to get readable content.

Method 3: Advanced Layout with pdfminer.six

pdfminer.six provides deep access to PDF layout objects. You can use it to create a more semantically accurate HTML file.

It can distinguish between paragraphs, headings, and tables. This is useful for data extraction, similar to techniques in our Python PDF Parser Guide.


from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTFigure
import html

def convert_pdfminer_to_html(pdf_path, output_html="output_pdfminer.html"):
    """
    Uses pdfminer.six to extract text and layout, generating structured HTML.
    """
    html_lines = [
        '
',
        'PDF Extract',
        ''
    ]

    for page_layout in extract_pages(pdf_path):
        html_lines.append('')
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                # Escape HTML special characters in the text
                text = html.escape(element.get_text().strip())
                if text:  # Only add non-empty text blocks
                    # Simple heuristic: treat short lines as potential headings
                    if len(text) < 100 and text.isupper():
                        html_lines.append(f'{text}')
                    else:
                        html_lines.append(f'{text}')
            # LTFigure objects could be images (handling omitted for brevity)
        html_lines.append('
')

    html_lines.append('')

    with open(output_html, 'w', encoding='utf-8') as f:
        f.write('\n'.join(html_lines))

    print(f"Structured HTML file created: {output_html}")

# Example usage
if __name__ == "__main__":
    convert_pdfminer_to_html("data_report.pdf")

This script creates a more nuanced HTML structure. It attempts to identify headings based on text length and case.

Handling Images and Complex Layouts

PDFs often contain images and multi-column layouts. These are challenging to convert.

For image extraction, PyMuPDF is very effective. You can save images and then reference them in your HTML with <img> tags.

Complex layouts may require a hybrid approach. Use pdf2htmlEX for visual accuracy or a custom script with PyMuPDF for control.

If you need to work with images directly, our guide on Python PDF to Image Conversion can be a useful first step.

Best Practices for Conversion

Follow these tips for better results. Always check the source PDF quality. A scanned PDF (image-based) requires OCR before text extraction.

Choose the right tool for the job. Use pdf2htmlEX for web publishing. Use PyMuPDF or pdfminer.six for data extraction and custom pipelines.

Clean the extracted text. Remove extra whitespace and non-printable characters. Validate your output HTML to ensure it's well-formed.

Consider performance for batch processing. PyMuPDF is generally faster than pdfminer.six for large documents.

Common Challenges and Solutions

You might encounter these issues. Missing text often means the PDF is scanned. Use an OCR library like Tesseract alongside PyMuPDF.

Poor layout preservation is common. pdf2htmlEX handles this best. For other libraries, you may need to write complex CSS based on extracted coordinates.

Large file size can happen if images are embedded at high resolution. Compress images after extraction before adding them to HTML.

Conclusion

Converting PDF to HTML with Python opens many doors. It makes static documents web-friendly and interactive.

The best method depends on your goal. Use pdf2htmlEX for visual fidelity. Use PyMuPDF for programmatic control and text extraction. Use pdfminer.six for deep layout analysis.

Start with a simple PDF. Experiment with the code examples above. You will soon be able to integrate PDF conversion into your own applications.

Remember, this process is often the first step in a larger workflow, such as content migration or data analysis.