Last modified: Nov 20, 2025 By Alexander Williams
Clean Excel Data with Python xlrd
Excel files often contain messy data. Python xlrd helps clean this data efficiently. This guide covers essential cleaning techniques.
Install xlrd and Import Libraries
First, install the xlrd library. Use pip for installation. Then import necessary modules in your script.
# Install xlrd using pip
# pip install xlrd
import xlrd
from collections import defaultdict
Load Excel File and Validate
Always validate your Excel file before processing. Check if the file exists and is readable. This prevents runtime errors.
def load_excel_file(file_path):
try:
workbook = xlrd.open_workbook(file_path)
return workbook
except FileNotFoundError:
print("File not found")
return None
except xlrd.XLRDError:
print("Cannot read Excel file")
return None
# Load the workbook
workbook = load_excel_file("data.xlsx")
Proper file validation ensures your script handles errors gracefully. It makes your code more robust and reliable.
Handle Missing and Empty Cells
Missing data is common in Excel files. Use xlrd to detect and handle empty cells appropriately in your data processing.
def find_empty_cells(sheet):
empty_cells = []
for row_idx in range(sheet.nrows):
for col_idx in range(sheet.ncols):
cell_value = sheet.cell_value(row_idx, col_idx)
if cell_value == "":
empty_cells.append((row_idx, col_idx))
return empty_cells
# Get first sheet
sheet = workbook.sheet_by_index(0)
empty_cells = find_empty_cells(sheet)
print(f"Found {len(empty_cells)} empty cells")
Found 15 empty cells
Learn more about detecting empty cells in our guide on Detect Empty Cells in Excel with Python xlrd.
Clean Data Types and Convert Values
Excel cells can have mixed data types. Convert them to consistent Python types for better processing and analysis.
def clean_cell_value(cell, cell_type):
if cell_type == xlrd.XL_CELL_EMPTY:
return None
elif cell_type == xlrd.XL_CELL_TEXT:
return str(cell).strip()
elif cell_type == xlrd.XL_CELL_NUMBER:
return float(cell)
elif cell_type == xlrd.XL_CELL_DATE:
return xlrd.xldate_as_datetime(cell, workbook.datemode)
else:
return str(cell)
# Process each cell with proper typing
cleaned_data = []
for row_idx in range(sheet.nrows):
row_data = []
for col_idx in range(sheet.ncols):
cell = sheet.cell_value(row_idx, col_idx)
cell_type = sheet.cell_type(row_idx, col_idx)
cleaned_value = clean_cell_value(cell, cell_type)
row_data.append(cleaned_value)
cleaned_data.append(row_data)
Proper data type handling ensures accurate calculations and analysis. It prevents type-related errors in your code.
Remove Duplicate Rows
Duplicate rows can skew your analysis. Identify and remove them using Python sets and xlrd data extraction.
def remove_duplicate_rows(sheet):
unique_rows = []
seen_rows = set()
for row_idx in range(sheet.nrows):
row_data = tuple(sheet.row_values(row_idx))
if row_data not in seen_rows:
seen_rows.add(row_data)
unique_rows.append(row_data)
return unique_rows
# Remove duplicates
unique_data = remove_duplicate_rows(sheet)
print(f"Original rows: {sheet.nrows}, After cleaning: {len(unique_data)}")
Original rows: 100, After cleaning: 95
Standardize Text Data
Text data often has inconsistent formatting. Apply standardization rules for case, spacing, and special characters.
def standardize_text(text):
if not isinstance(text, str):
return text
# Remove extra whitespace
text = ' '.join(text.split())
# Convert to title case for consistency
text = text.title()
# Remove special characters (keep alphanumeric and spaces)
text = ''.join(char for char in text if char.isalnum() or char.isspace())
return text
# Apply text standardization
for row in cleaned_data:
for i, cell in enumerate(row):
if isinstance(cell, str):
row[i] = standardize_text(cell)
Text standardization improves data quality. It makes string matching and searching more reliable.
Handle Multiple Sheets Efficiently
Excel workbooks often contain multiple sheets. Process them systematically to clean data across all sheets.
def clean_all_sheets(workbook):
cleaned_sheets = {}
for sheet_name in workbook.sheet_names():
sheet = workbook.sheet_by_name(sheet_name)
sheet_data = []
for row_idx in range(sheet.nrows):
row_data = []
for col_idx in range(sheet.ncols):
cell = sheet.cell_value(row_idx, col_idx)
cell_type = sheet.cell_type(row_idx, col_idx)
cleaned_value = clean_cell_value(cell, cell_type)
row_data.append(cleaned_value)
sheet_data.append(row_data)
cleaned_sheets[sheet_name] = sheet_data
return cleaned_sheets
# Clean all sheets
all_cleaned_data = clean_all_sheets(workbook)
For advanced multi-sheet handling, see Work with Multiple Excel Sheets in Python xlrd.
Validate Data Quality
After cleaning, validate your data quality. Check for consistency, completeness, and business rule compliance.
def validate_cleaned_data(cleaned_data):
validation_report = {
'total_rows': len(cleaned_data),
'empty_cells': 0,
'invalid_numbers': 0,
'text_fields': 0
}
for row in cleaned_data:
for cell in row:
if cell is None:
validation_report['empty_cells'] += 1
elif isinstance(cell, str):
validation_report['text_fields'] += 1
elif isinstance(cell, (int, float)):
if cell < 0: # Example business rule
validation_report['invalid_numbers'] += 1
return validation_report
# Generate validation report
report = validate_cleaned_data(cleaned_data)
print("Validation Report:", report)
Validation Report: {'total_rows': 95, 'empty_cells': 8, 'invalid_numbers': 2, 'text_fields': 285}
Export Cleaned Data
Save your cleaned data for further analysis. Export to CSV or other formats for use in different tools.
import csv
def export_to_csv(cleaned_data, output_file):
with open(output_file, 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerows(cleaned_data)
print(f"Data exported to {output_file}")
# Export cleaned data
export_to_csv(cleaned_data, "cleaned_data.csv")
For more export options, check Export Excel to CSV with Python xlrd.
Conclusion
Cleaning Excel data with xlrd ensures data quality. Follow these best practices for reliable results.
Always validate files first. Handle missing values properly. Standardize text formatting. Remove duplicates systematically.
These techniques create clean, analysis-ready datasets. They save time and prevent errors in downstream processes.
Remember to test your cleaning pipeline thoroughly. Document your cleaning rules for reproducibility.