Last modified: Nov 08, 2025 By Alexander Williams

Python-docx Tutorial: Read Parse docx Content

Working with Word documents is common in business and data processing. Python's python-docx library makes it easy to read and parse docx files programmatically. This tutorial covers everything you need to know.

You will learn to extract text, tables, paragraphs, and formatting from Word documents. This skill is valuable for data extraction, document analysis, and automation tasks.

Installing python-docx Library

First, you need to install the python-docx library. Use pip, Python's package manager. The installation is straightforward and quick.


pip install python-docx

This command downloads and installs the latest version. The library works with Python 3.6 and higher. It supports both reading and creating Word documents.

Loading a Document

To start working with a Word document, you need to load it. Use the Document class from the python-docx module. Pass the file path to the constructor.


from docx import Document

# Load an existing document
doc = Document('example.docx')

This code creates a Document object. It represents the entire Word document. You can now access its content and structure.

If the file doesn't exist, python-docx will raise an exception. Always ensure the file path is correct. You can also create new documents using Document() without arguments.

Reading Paragraphs

Paragraphs are the basic text containers in Word documents. The paragraphs property returns all paragraphs in the document. It's a list of paragraph objects.


# Read all paragraphs
for paragraph in doc.paragraphs:
    print(paragraph.text)

Welcome to our document
This is the first paragraph.
This is the second paragraph.
Document content ends here.

Each paragraph object has a text attribute. This contains the paragraph's text content. Empty paragraphs return empty strings.

You can access specific paragraphs by index. For example, doc.paragraphs[0] gets the first paragraph. Remember that indexing starts at zero.

Extracting Text Formatting

Beyond plain text, you can access formatting information. Each paragraph contains runs. Runs are text segments with consistent formatting.


# Access runs and their formatting
for paragraph in doc.paragraphs:
    for run in paragraph.runs:
        print(f"Text: {run.text}")
        print(f"Bold: {run.bold}")
        print(f"Italic: {run.italic}")
        print(f"Font: {run.font.name}")

Text: Welcome to our document
Bold: True
Italic: False
Font: Calibri
Text: This is normal text.
Bold: None
Italic: None
Font: None

Runs help you understand how text is styled. You can check for bold, italic, underline, and font properties. This is useful for processing formatted documents.

Reading Tables from Documents

Tables are common in Word documents. The tables property gives access to all tables. Each table has rows and cells.


# Read table data
for i, table in enumerate(doc.tables):
    print(f"Table {i+1}:")
    for row in table.rows:
        row_data = [cell.text for cell in row.cells]
        print(row_data)

Table 1:
['Name', 'Age', 'City']
['John', '25', 'New York']
['Sarah', '30', 'London']
['Mike', '35', 'Tokyo']

This code extracts all table data into a structured format. You can process this data further with pandas or other libraries. Tables maintain their row and cell structure.

Accessing Document Properties

Word documents contain metadata called core properties. These include title, author, and creation date. You can access these using the core_properties attribute.


# Access document properties
props = doc.core_properties
print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")

Title: Sample Document
Author: John Smith
Created: 2023-10-15 09:30:00
Modified: 2023-10-16 14:20:00

Document properties help with organization and tracking. They're useful for document management systems. You can also modify these properties programmatically.

Working with Sections and Page Layout

Documents are divided into sections. Each section can have different page setup. This includes margins, orientation, and page size.

You can access section information through the sections property. This is useful for understanding document structure. For detailed page setup, check our Python-docx Page Setup: Margins, Orientation, Layout guide.

Reading Headers and Footers

Headers and footers contain repeated content on each page. Python-docx provides access to these elements. You can extract text from different header and footer types.


# Read header content
for section in doc.sections:
    header = section.header
    for paragraph in header.paragraphs:
        print(f"Header: {paragraph.text}")

Headers and footers can vary by section. Some documents have different first page headers. Our Python-docx Headers Footers Guide covers this in detail.

Handling Images and Other Objects

Word documents can contain images, charts, and other objects. While python-docx is primarily text-focused, you can detect inline shapes. However, extracting image data requires additional processing.

For adding images to documents, see our guide on Add Images to docx Using Python-docx. This covers the creation side of working with visual elements.

Practical Example: Document Analysis

Let's combine these concepts into a practical example. We'll analyze a document and extract key information.


def analyze_document(file_path):
    doc = Document(file_path)
    
    # Basic statistics
    stats = {
        'paragraph_count': len(doc.paragraphs),
        'table_count': len(doc.tables),
        'section_count': len(doc.sections),
        'bold_text': [],
        'table_data': []
    }
    
    # Find bold text
    for paragraph in doc.paragraphs:
        for run in paragraph.runs:
            if run.bold and run.text.strip():
                stats['bold_text'].append(run.text)
    
    # Extract table data
    for table in doc.tables:
        table_content = []
        for row in table.rows:
            table_content.append([cell.text for cell in row.cells])
        stats['table_data'].append(table_content)
    
    return stats

# Usage
results = analyze_document('sample.docx')
print(f"Paragraphs: {results['paragraph_count']}")
print(f"Tables: {results['table_count']}")
print(f"Bold sections: {results['bold_text']}")

Paragraphs: 15
Tables: 2
Bold sections: ['Important Note', 'Summary', 'Conclusion']

This example shows how to extract meaningful information from documents. You can adapt this for specific use cases like report generation or data extraction.

Error Handling and Best Practices

Always include error handling when working with files. Documents might be corrupted or have unexpected structure. Use try-except blocks to handle potential issues.


try:
    doc = Document('example.docx')
    # Process document
except Exception as e:
    print(f"Error reading document: {e}")

This prevents your program from crashing on invalid files. You can also check if files exist before processing. Use the os.path module for file validation.

Conclusion

Python-docx is a powerful library for reading and parsing Word documents. You can extract text, tables, formatting, and metadata. This enables document automation and data extraction tasks.

The library is well-documented and actively maintained. It handles most common Word document features. For advanced formatting, check our Python-docx Text Styling Guide.

Start by experimenting with simple documents. Gradually incorporate more complex features. Soon you'll be efficiently processing Word documents with Python.