Last modified: Nov 20, 2025 By Alexander Williams

Clean Excel Data with Python xlrd

Excel files often contain messy data. Python xlrd helps clean this data efficiently. This guide covers essential cleaning techniques.

Install xlrd and Import Libraries

First, install the xlrd library. Use pip for installation. Then import necessary modules in your script.

 
# Install xlrd using pip
# pip install xlrd

import xlrd
from collections import defaultdict

Load Excel File and Validate

Always validate your Excel file before processing. Check if the file exists and is readable. This prevents runtime errors.

 
def load_excel_file(file_path):
    try:
        workbook = xlrd.open_workbook(file_path)
        return workbook
    except FileNotFoundError:
        print("File not found")
        return None
    except xlrd.XLRDError:
        print("Cannot read Excel file")
        return None

# Load the workbook
workbook = load_excel_file("data.xlsx")

Proper file validation ensures your script handles errors gracefully. It makes your code more robust and reliable.

Handle Missing and Empty Cells

Missing data is common in Excel files. Use xlrd to detect and handle empty cells appropriately in your data processing.

 
def find_empty_cells(sheet):
    empty_cells = []
    for row_idx in range(sheet.nrows):
        for col_idx in range(sheet.ncols):
            cell_value = sheet.cell_value(row_idx, col_idx)
            if cell_value == "":
                empty_cells.append((row_idx, col_idx))
    return empty_cells

# Get first sheet
sheet = workbook.sheet_by_index(0)
empty_cells = find_empty_cells(sheet)
print(f"Found {len(empty_cells)} empty cells")

Found 15 empty cells

Learn more about detecting empty cells in our guide on Detect Empty Cells in Excel with Python xlrd.

Clean Data Types and Convert Values

Excel cells can have mixed data types. Convert them to consistent Python types for better processing and analysis.

 
def clean_cell_value(cell, cell_type):
    if cell_type == xlrd.XL_CELL_EMPTY:
        return None
    elif cell_type == xlrd.XL_CELL_TEXT:
        return str(cell).strip()
    elif cell_type == xlrd.XL_CELL_NUMBER:
        return float(cell)
    elif cell_type == xlrd.XL_CELL_DATE:
        return xlrd.xldate_as_datetime(cell, workbook.datemode)
    else:
        return str(cell)

# Process each cell with proper typing
cleaned_data = []
for row_idx in range(sheet.nrows):
    row_data = []
    for col_idx in range(sheet.ncols):
        cell = sheet.cell_value(row_idx, col_idx)
        cell_type = sheet.cell_type(row_idx, col_idx)
        cleaned_value = clean_cell_value(cell, cell_type)
        row_data.append(cleaned_value)
    cleaned_data.append(row_data)

Proper data type handling ensures accurate calculations and analysis. It prevents type-related errors in your code.

Remove Duplicate Rows

Duplicate rows can skew your analysis. Identify and remove them using Python sets and xlrd data extraction.

 
def remove_duplicate_rows(sheet):
    unique_rows = []
    seen_rows = set()
    
    for row_idx in range(sheet.nrows):
        row_data = tuple(sheet.row_values(row_idx))
        if row_data not in seen_rows:
            seen_rows.add(row_data)
            unique_rows.append(row_data)
    
    return unique_rows

# Remove duplicates
unique_data = remove_duplicate_rows(sheet)
print(f"Original rows: {sheet.nrows}, After cleaning: {len(unique_data)}")

Original rows: 100, After cleaning: 95

Standardize Text Data

Text data often has inconsistent formatting. Apply standardization rules for case, spacing, and special characters.

 
def standardize_text(text):
    if not isinstance(text, str):
        return text
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Convert to title case for consistency
    text = text.title()
    
    # Remove special characters (keep alphanumeric and spaces)
    text = ''.join(char for char in text if char.isalnum() or char.isspace())
    
    return text

# Apply text standardization
for row in cleaned_data:
    for i, cell in enumerate(row):
        if isinstance(cell, str):
            row[i] = standardize_text(cell)

Text standardization improves data quality. It makes string matching and searching more reliable.

Handle Multiple Sheets Efficiently

Excel workbooks often contain multiple sheets. Process them systematically to clean data across all sheets.

 
def clean_all_sheets(workbook):
    cleaned_sheets = {}
    
    for sheet_name in workbook.sheet_names():
        sheet = workbook.sheet_by_name(sheet_name)
        sheet_data = []
        
        for row_idx in range(sheet.nrows):
            row_data = []
            for col_idx in range(sheet.ncols):
                cell = sheet.cell_value(row_idx, col_idx)
                cell_type = sheet.cell_type(row_idx, col_idx)
                cleaned_value = clean_cell_value(cell, cell_type)
                row_data.append(cleaned_value)
            sheet_data.append(row_data)
        
        cleaned_sheets[sheet_name] = sheet_data
    
    return cleaned_sheets

# Clean all sheets
all_cleaned_data = clean_all_sheets(workbook)

For advanced multi-sheet handling, see Work with Multiple Excel Sheets in Python xlrd.

Validate Data Quality

After cleaning, validate your data quality. Check for consistency, completeness, and business rule compliance.

 
def validate_cleaned_data(cleaned_data):
    validation_report = {
        'total_rows': len(cleaned_data),
        'empty_cells': 0,
        'invalid_numbers': 0,
        'text_fields': 0
    }
    
    for row in cleaned_data:
        for cell in row:
            if cell is None:
                validation_report['empty_cells'] += 1
            elif isinstance(cell, str):
                validation_report['text_fields'] += 1
            elif isinstance(cell, (int, float)):
                if cell < 0:  # Example business rule
                    validation_report['invalid_numbers'] += 1
    
    return validation_report

# Generate validation report
report = validate_cleaned_data(cleaned_data)
print("Validation Report:", report)

Validation Report: {'total_rows': 95, 'empty_cells': 8, 'invalid_numbers': 2, 'text_fields': 285}

Export Cleaned Data

Save your cleaned data for further analysis. Export to CSV or other formats for use in different tools.

 
import csv

def export_to_csv(cleaned_data, output_file):
    with open(output_file, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerows(cleaned_data)
    
    print(f"Data exported to {output_file}")

# Export cleaned data
export_to_csv(cleaned_data, "cleaned_data.csv")

For more export options, check Export Excel to CSV with Python xlrd.

Conclusion

Cleaning Excel data with xlrd ensures data quality. Follow these best practices for reliable results.

Always validate files first. Handle missing values properly. Standardize text formatting. Remove duplicates systematically.

These techniques create clean, analysis-ready datasets. They save time and prevent errors in downstream processes.

Remember to test your cleaning pipeline thoroughly. Document your cleaning rules for reproducibility.