Last modified: Nov 10, 2025 By Alexander Williams

Convert HTML to DOCX with Python

HTML to DOCX conversion is a common need. Python makes this process efficient. The python-docx library is perfect for this task.

This tutorial covers the complete workflow. You will learn to parse HTML content. Then convert it to formatted Word documents.

Setting Up Your Environment

First, install the required libraries. You need python-docx and BeautifulSoup. These handle DOCX creation and HTML parsing.


pip install python-docx beautifulsoup4

The installation is straightforward. These libraries have no major dependencies. They work on all major operating systems.

Basic HTML to DOCX Conversion

Let's start with a simple conversion. We'll extract text from HTML. Then add it to a Word document.

 
from docx import Document
from bs4 import BeautifulSoup

def html_to_docx_basic(html_content, output_path):
    # Create a new Document
    doc = Document()
    
    # Parse HTML content
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Extract text from paragraphs
    for p in soup.find_all('p'):
        doc.add_paragraph(p.get_text())
    
    # Save the document
    doc.save(output_path)

# Example usage
html_sample = """

This is the first paragraph.

This is the second paragraph with bold text.

""" html_to_docx_basic(html_sample, 'basic_output.docx')

This basic function extracts paragraph text. It ignores formatting for now. The output is a simple Word document.

Handling Text Formatting

HTML contains rich formatting. We need to preserve bold, italic, and underline. Python-docx supports inline formatting.

Learn more about Python docx Inline Formatting: Bold, Italic, Underline for detailed techniques.

 
def add_formatted_text(paragraph, element):
    for content in element.contents:
        if content.name is None:
            # Regular text
            paragraph.add_run(content)
        elif content.name == 'strong':
            # Bold text
            run = paragraph.add_run(content.get_text())
            run.bold = True
        elif content.name == 'em':
            # Italic text
            run = paragraph.add_run(content.get_text())
            run.italic = True
        elif content.name == 'u':
            # Underline text
            run = paragraph.add_run(content.get_text())
            run.underline = True

def html_to_docx_formatted(html_content, output_path):
    doc = Document()
    soup = BeautifulSoup(html_content, 'html.parser')
    
    for p in soup.find_all('p'):
        paragraph = doc.add_paragraph()
        add_formatted_text(paragraph, p)
    
    doc.save(output_path)

# Test with formatted HTML
formatted_html = """

Normal text with bold, italic, and underlined words.

""" html_to_docx_formatted(formatted_html, 'formatted_output.docx')

The add_formatted_text function handles different HTML tags. It creates appropriate runs in the document.

Working with Headings and Structure

HTML documents have hierarchical structure. Headings are particularly important. Python-docx supports heading levels.

 
def html_to_docx_structured(html_content, output_path):
    doc = Document()
    soup = BeautifulSoup(html_content, 'html.parser')
    
    for element in soup.find_all(['h1', 'h2', 'h3', 'p']):
        if element.name == 'h1':
            doc.add_heading(element.get_text(), level=1)
        elif element.name == 'h2':
            doc.add_heading(element.get_text(), level=2)
        elif element.name == 'h3':
            doc.add_heading(element.get_text(), level=3)
        elif element.name == 'p':
            paragraph = doc.add_paragraph()
            add_formatted_text(paragraph, element)
    
    doc.save(output_path)

# Example with headings
structured_html = """

Main Title

Section Heading

Regular paragraph text.

Subsection Heading

Another paragraph with important content.

""" html_to_docx_structured(structured_html, 'structured_output.docx')

The add_heading method creates proper headings. This maintains document structure and readability.

Adding Lists and Tables

HTML lists and tables need special handling. Python-docx provides methods for these elements. Let's implement list conversion.

 
def process_lists(doc, soup):
    for ul in soup.find_all('ul'):
        for li in ul.find_all('li'):
            doc.add_paragraph(li.get_text(), style='List Bullet')
    
    for ol in soup.find_all('ol'):
        for li in ol.find_all('li'):
            doc.add_paragraph(li.get_text(), style='List Number')

def html_to_docx_with_lists(html_content, output_path):
    doc = Document()
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Process regular content
    html_to_docx_structured(html_content, output_path)
    
    # Process lists
    process_lists(doc, soup)
    
    doc.save(output_path)

# Example with lists
list_html = """

Project Requirements

  • User authentication system
  • Data export functionality
  • Report generation
  1. Design phase
  2. Development phase
  3. Testing phase
""" html_to_docx_with_lists(list_html, 'list_output.docx')

The process_lists function handles both ordered and unordered lists. It applies appropriate Word styles.

Advanced Features and Hyperlinks

Modern documents often contain hyperlinks. Python-docx can handle these effectively. Links maintain their functionality in Word.

For comprehensive link handling, see Add Hyperlinks to docx in Python Tutorial.

 
def process_hyperlinks(paragraph, element):
    for content in element.contents:
        if content.name == 'a':
            # Add hyperlink
            run = paragraph.add_run(content.get_text())
            run.style = 'Hyperlink'
            # Note: Actual hyperlink addition requires additional handling
        else:
            if hasattr(content, 'name') and content.name is None:
                paragraph.add_run(content)

def html_to_docx_advanced(html_content, output_path):
    doc = Document()
    soup = BeautifulSoup(html_content, 'html.parser')
    
    for p in soup.find_all('p'):
        paragraph = doc.add_paragraph()
        process_hyperlinks(paragraph, p)
    
    doc.save(output_path)

Hyperlinks require special attention. The above code demonstrates basic link processing.

Error Handling and Best Practices

Robust conversion requires proper error handling. Common issues include malformed HTML and encoding problems.

Learn about Fix Common python-docx Errors to troubleshoot effectively.

 
def safe_html_to_docx(html_content, output_path):
    try:
        doc = Document()
        soup = BeautifulSoup(html_content, 'html.parser')
        
        if not soup.find():
            doc.add_paragraph("No valid HTML content found")
        else:
            html_to_docx_structured(html_content, output_path)
        
        doc.save(output_path)
        return True
    except Exception as e:
        print(f"Conversion error: {e}")
        return False

# Safe conversion example
result = safe_html_to_docx(html_sample, 'safe_output.docx')
if result:
    print("Conversion successful")
else:
    print("Conversion failed")

The safe_html_to_docx function includes error handling. It gracefully manages conversion failures.

Performance Considerations

Large HTML documents can impact performance. Optimize your conversion code for better speed. Use efficient parsing strategies.

Batch processing helps with multiple documents. Consider caching for repeated conversions.

Conclusion

HTML to DOCX conversion with Python is powerful. The python-docx library provides excellent tools. You can create professional documents automatically.

Start with basic text extraction. Then add formatting and structural elements. Finally, implement error handling and optimization.

This workflow saves time and ensures consistency. It's perfect for reports, documentation, and automated content generation.

Remember to test with various HTML inputs. Handle edge cases for robust production use.