Last modified: Nov 09, 2025 By Alexander Williams
Extract Text from docx in Python
Working with Word documents is common. Python offers several methods to extract text from DOCX files. This guide explores the best approaches.
We will cover three main libraries. These are python-docx, docx2txt, and python-docx2. Each has unique strengths for different use cases.
Why Extract Text from DOCX?
Text extraction is vital for data processing. You might need to analyze document content. Or automate report generation workflows.
Extracted text can feed into other systems. It enables content analysis and data migration. Understanding the options helps choose the right tool.
Method 1: Using python-docx Library
The python-docx library is popular. It allows both reading and writing Word documents. It provides fine-grained control over document elements.
First, install the library using pip. Run the command below in your terminal.
pip install python-docx
Here is a basic example to extract text. This code opens a DOCX file and reads all paragraphs.
from docx import Document
# Load the document
doc = Document('sample.docx')
# Extract text from all paragraphs
full_text = []
for paragraph in doc.paragraphs:
full_text.append(paragraph.text)
# Join all paragraphs into a single string
text_content = '\n'.join(full_text)
print(text_content)
The output will be the plain text from your document. It preserves paragraph breaks with newline characters.
This is the first paragraph.
This is the second paragraph with some bold text.
This is the final paragraph in the document.
For more advanced document manipulation, see our guide on Python-docx Text Styling Guide.
Extracting Text from Tables
DOCX files often contain tables. python-docx can extract text from table cells. You need to iterate through tables and rows.
from docx import Document
doc = Document('document_with_tables.docx')
table_text = []
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
table_text.append(cell.text)
table_content = '\n'.join(table_text)
print(table_content)
This approach gets all text from table cells. For complex table operations, check Python-docx Table Creation Best Practices Guide.
Method 2: Using docx2txt Library
docx2txt is simpler than python-docx. It focuses solely on text extraction. It handles images and provides cleaner output.
Install docx2txt using pip.
pip install docx2txt
Here is how to use docx2txt for basic extraction.
import docx2txt
# Extract text from DOCX file
text = docx2txt.process('sample.docx')
print(text)
The process function returns the entire document text. It includes content from paragraphs, tables, and headers.
Document Title
This is the first paragraph.
Table Content:
Cell 1 Cell 2
Cell 3 Cell 4
Footer text appears here.
Method 3: Using python-docx2 Library
python-docx2 is a community-maintained fork. It offers similar functionality to python-docx. It has better support for newer Word features.
Installation is straightforward with pip.
pip install python-docx2
The usage is identical to python-docx. This makes migration between libraries easy.
from docx2 import Document
doc = Document('sample.docx')
text_content = '\n'.join([paragraph.text for paragraph in doc.paragraphs])
print(text_content)
Comparing the Three Methods
Each library has advantages. Choose based on your specific needs.
python-docx is best for detailed document manipulation. It allows access to styles, formatting, and document structure.
docx2txt is perfect for simple text extraction. It requires less code and handles complex documents well.
python-docx2 offers modern features. It is good for documents using newer Word formats.
For batch processing multiple files, see Batch Generate docx Files in Python.
Handling Complex Documents
Real-world documents have headers, footers, and footnotes. python-docx can extract text from these sections.
from docx import Document
doc = Document('complex_document.docx')
# Extract from main document
main_text = [p.text for p in doc.paragraphs]
# Extract from headers
header_text = []
for section in doc.sections:
header = section.header
for p in header.paragraphs:
header_text.append(p.text)
# Extract from footers
footer_text = []
for section in doc.sections:
footer = section.footer
for p in footer.paragraphs:
footer_text.append(p.text)
print("Main content:", '\n'.join(main_text))
print("Headers:", '\n'.join(header_text))
print("Footers:", '\n'.join(footer_text))
Error Handling and Best Practices
Always include error handling. Files might be corrupted or have wrong formats.
from docx import Document
from docx.opc.exceptions import PackageNotFoundError
try:
doc = Document('nonexistent.docx')
text = '\n'.join([p.text for p in doc.paragraphs])
print(text)
except PackageNotFoundError:
print("Error: File not found or invalid DOCX format")
except Exception as e:
print(f"An error occurred: {str(e)}")
This prevents crashes from invalid files. It provides helpful error messages for debugging.
Conclusion
Extracting text from DOCX files is essential. Python offers multiple libraries for this task.
python-docx provides the most control. docx2txt is simplest for basic needs. python-docx2 offers modern compatibility.
Choose based on your project requirements. Consider document complexity and needed features.
All methods work well for most use cases. Start with docx2txt for simple extraction. Use python-docx for advanced document processing.