Last modified: Nov 10, 2025 By Alexander Williams
Convert HTML to DOCX with Python
HTML to DOCX conversion is a common need. Python makes this process efficient. The python-docx library is perfect for this task.
This tutorial covers the complete workflow. You will learn to parse HTML content. Then convert it to formatted Word documents.
Setting Up Your Environment
First, install the required libraries. You need python-docx and BeautifulSoup. These handle DOCX creation and HTML parsing.
pip install python-docx beautifulsoup4
The installation is straightforward. These libraries have no major dependencies. They work on all major operating systems.
Basic HTML to DOCX Conversion
Let's start with a simple conversion. We'll extract text from HTML. Then add it to a Word document.
from docx import Document
from bs4 import BeautifulSoup
def html_to_docx_basic(html_content, output_path):
# Create a new Document
doc = Document()
# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Extract text from paragraphs
for p in soup.find_all('p'):
doc.add_paragraph(p.get_text())
# Save the document
doc.save(output_path)
# Example usage
html_sample = """
This is the first paragraph.
This is the second paragraph with bold text.
"""
html_to_docx_basic(html_sample, 'basic_output.docx')
This basic function extracts paragraph text. It ignores formatting for now. The output is a simple Word document.
Handling Text Formatting
HTML contains rich formatting. We need to preserve bold, italic, and underline. Python-docx supports inline formatting.
Learn more about Python docx Inline Formatting: Bold, Italic, Underline for detailed techniques.
def add_formatted_text(paragraph, element):
for content in element.contents:
if content.name is None:
# Regular text
paragraph.add_run(content)
elif content.name == 'strong':
# Bold text
run = paragraph.add_run(content.get_text())
run.bold = True
elif content.name == 'em':
# Italic text
run = paragraph.add_run(content.get_text())
run.italic = True
elif content.name == 'u':
# Underline text
run = paragraph.add_run(content.get_text())
run.underline = True
def html_to_docx_formatted(html_content, output_path):
doc = Document()
soup = BeautifulSoup(html_content, 'html.parser')
for p in soup.find_all('p'):
paragraph = doc.add_paragraph()
add_formatted_text(paragraph, p)
doc.save(output_path)
# Test with formatted HTML
formatted_html = """
Normal text with bold, italic, and underlined words.
"""
html_to_docx_formatted(formatted_html, 'formatted_output.docx')
The add_formatted_text function handles different HTML tags. It creates appropriate runs in the document.
Working with Headings and Structure
HTML documents have hierarchical structure. Headings are particularly important. Python-docx supports heading levels.
def html_to_docx_structured(html_content, output_path):
doc = Document()
soup = BeautifulSoup(html_content, 'html.parser')
for element in soup.find_all(['h1', 'h2', 'h3', 'p']):
if element.name == 'h1':
doc.add_heading(element.get_text(), level=1)
elif element.name == 'h2':
doc.add_heading(element.get_text(), level=2)
elif element.name == 'h3':
doc.add_heading(element.get_text(), level=3)
elif element.name == 'p':
paragraph = doc.add_paragraph()
add_formatted_text(paragraph, element)
doc.save(output_path)
# Example with headings
structured_html = """
Main Title
Section Heading
Regular paragraph text.
Subsection Heading
Another paragraph with important content.
"""
html_to_docx_structured(structured_html, 'structured_output.docx')
The add_heading method creates proper headings. This maintains document structure and readability.
Adding Lists and Tables
HTML lists and tables need special handling. Python-docx provides methods for these elements. Let's implement list conversion.
def process_lists(doc, soup):
for ul in soup.find_all('ul'):
for li in ul.find_all('li'):
doc.add_paragraph(li.get_text(), style='List Bullet')
for ol in soup.find_all('ol'):
for li in ol.find_all('li'):
doc.add_paragraph(li.get_text(), style='List Number')
def html_to_docx_with_lists(html_content, output_path):
doc = Document()
soup = BeautifulSoup(html_content, 'html.parser')
# Process regular content
html_to_docx_structured(html_content, output_path)
# Process lists
process_lists(doc, soup)
doc.save(output_path)
# Example with lists
list_html = """
Project Requirements
- User authentication system
- Data export functionality
- Report generation
- Design phase
- Development phase
- Testing phase
"""
html_to_docx_with_lists(list_html, 'list_output.docx')
The process_lists function handles both ordered and unordered lists. It applies appropriate Word styles.
Advanced Features and Hyperlinks
Modern documents often contain hyperlinks. Python-docx can handle these effectively. Links maintain their functionality in Word.
For comprehensive link handling, see Add Hyperlinks to docx in Python Tutorial.
def process_hyperlinks(paragraph, element):
for content in element.contents:
if content.name == 'a':
# Add hyperlink
run = paragraph.add_run(content.get_text())
run.style = 'Hyperlink'
# Note: Actual hyperlink addition requires additional handling
else:
if hasattr(content, 'name') and content.name is None:
paragraph.add_run(content)
def html_to_docx_advanced(html_content, output_path):
doc = Document()
soup = BeautifulSoup(html_content, 'html.parser')
for p in soup.find_all('p'):
paragraph = doc.add_paragraph()
process_hyperlinks(paragraph, p)
doc.save(output_path)
Hyperlinks require special attention. The above code demonstrates basic link processing.
Error Handling and Best Practices
Robust conversion requires proper error handling. Common issues include malformed HTML and encoding problems.
Learn about Fix Common python-docx Errors to troubleshoot effectively.
def safe_html_to_docx(html_content, output_path):
try:
doc = Document()
soup = BeautifulSoup(html_content, 'html.parser')
if not soup.find():
doc.add_paragraph("No valid HTML content found")
else:
html_to_docx_structured(html_content, output_path)
doc.save(output_path)
return True
except Exception as e:
print(f"Conversion error: {e}")
return False
# Safe conversion example
result = safe_html_to_docx(html_sample, 'safe_output.docx')
if result:
print("Conversion successful")
else:
print("Conversion failed")
The safe_html_to_docx function includes error handling. It gracefully manages conversion failures.
Performance Considerations
Large HTML documents can impact performance. Optimize your conversion code for better speed. Use efficient parsing strategies.
Batch processing helps with multiple documents. Consider caching for repeated conversions.
Conclusion
HTML to DOCX conversion with Python is powerful. The python-docx library provides excellent tools. You can create professional documents automatically.
Start with basic text extraction. Then add formatting and structural elements. Finally, implement error handling and optimization.
This workflow saves time and ensures consistency. It's perfect for reports, documentation, and automated content generation.
Remember to test with various HTML inputs. Handle edge cases for robust production use.