Last modified: Nov 25, 2025 By Alexander Williams

Create Tidy Data Tables from Excel Using Python pyexcel

Excel files often contain messy data. Python pyexcel helps clean them.

This guide shows how to transform spreadsheets into tidy data tables.

What is Tidy Data?

Tidy data follows specific rules. Each variable forms a column.

Each observation forms a row. Each cell contains one value.

Messy data violates these principles. It makes analysis difficult.

Installing pyexcel

First, install pyexcel and required plugins. Use pip package manager.


pip install pyexcel pyexcel-xlsx pyexcel-xls

These packages handle Excel formats. They provide reading capabilities.

Loading Excel Files

Start by loading your Excel file. Use get_array function.


import pyexcel as pe

# Load Excel file into array
data_array = pe.get_array(file_name="messy_data.xlsx")
print("Original data:")
print(data_array)

Original data:
[['Name', 'Age', 'City', None], ['John', 25, 'New York', 'Extra'], ['Jane', 30, None, None], [None, None, None, None], ['Bob', 'unknown', 'Chicago', 'Data']]

The output shows common data issues. Missing values and inconsistent formatting.

Identifying Data Problems

Real-world Excel files have various issues. Missing values appear as blanks.

Inconsistent data types cause errors. Extra columns add noise.

Empty rows break data structure. Data validation is crucial.

Cleaning Data Step by Step

Remove empty rows first. They serve no purpose.


def remove_empty_rows(data):
    """Remove completely empty rows from dataset"""
    return [row for row in data if any(cell is not None and cell != '' for cell in row)]

cleaned_data = remove_empty_rows(data_array)
print("After removing empty rows:")
print(cleaned_data)

After removing empty rows:
[['Name', 'Age', 'City', None], ['John', 25, 'New York', 'Extra'], ['Jane', 30, None, None], ['Bob', 'unknown', 'Chicago', 'Data']]

Empty rows disappear. Data becomes more compact.

Handling Missing Values

Missing values disrupt analysis. Replace them appropriately.


def handle_missing_values(data):
    """Replace None values with appropriate defaults"""
    cleaned = []
    headers = data[0]
    
    for row in data[1:]:
        cleaned_row = []
        for i, cell in enumerate(row):
            if cell is None or cell == '':
                # Use appropriate default based on header
                header = headers[i].lower() if i < len(headers) else 'unknown'
                if 'age' in header:
                    cleaned_row.append(0)  # Default age
                else:
                    cleaned_row.append('Unknown')  # Default text
            else:
                cleaned_row.append(cell)
        cleaned.append(cleaned_row)
    
    return [headers] + cleaned

filled_data = handle_missing_values(cleaned_data)
print("After handling missing values:")
print(filled_data)

After handling missing values:
[['Name', 'Age', 'City', None], ['John', 25, 'New York', 'Unknown'], ['Jane', 30, 'Unknown', 'Unknown'], ['Bob', 'unknown', 'Chicago', 'Unknown']]

Missing values get sensible defaults. Data completeness improves.

Standardizing Data Types

Inconsistent types cause errors. Convert them properly.


def standardize_data_types(data):
    """Convert data to appropriate types"""
    headers = data[0]
    standardized = [headers]
    
    for row in data[1:]:
        standardized_row = []
        for i, cell in enumerate(row):
            header = headers[i].lower() if i < len(headers) else 'unknown'
            
            if 'age' in header:
                try:
                    standardized_row.append(int(cell))
                except (ValueError, TypeError):
                    standardized_row.append(0)  # Default for invalid ages
            else:
                standardized_row.append(str(cell))
        
        standardized.append(standardized_row)
    
    return standardized

standardized_data = standardize_data_types(filled_data)
print("After standardizing data types:")
print(standardized_data)

After standardizing data types:
[['Name', 'Age', 'City', None], ['John', 25, 'New York', 'Unknown'], ['Jane', 30, 'Unknown', 'Unknown'], ['Bob', 0, 'Chicago', 'Unknown']]

All ages become integers. Text fields remain strings.

Removing Extra Columns

Unnecessary columns add noise. Identify and remove them.


def remove_extra_columns(data):
    """Remove columns that are mostly empty or redundant"""
    if not data:
        return data
    
    headers = data[0]
    # Count non-empty cells per column
    column_stats = []
    for col_index in range(len(headers)):
        non_empty_count = sum(1 for row in data[1:] 
                            if col_index < len(row) and 
                            row[col_index] not in [None, 'Unknown', ''])
        column_stats.append((headers[col_index], non_empty_count))
    
    # Keep columns with substantial data
    keep_columns = [i for i, (header, count) in enumerate(column_stats) 
                   if count > len(data) * 0.3 or header in ['Name', 'Age', 'City']]
    
    cleaned_data = []
    for row in data:
        cleaned_row = [row[i] for i in keep_columns if i < len(row)]
        cleaned_data.append(cleaned_row)
    
    return cleaned_data

final_data = remove_extra_columns(standardized_data)
print("Final tidy data:")
print(final_data)

Final tidy data:
[['Name', 'Age', 'City'], ['John', 25, 'New York'], ['Jane', 30, 'Unknown'], ['Bob', 0, 'Chicago']]

Extra columns disappear. Only relevant data remains.

Saving Clean Data

Save the cleaned data back to Excel. Use save_as function.


# Save cleaned data to new Excel file
pe.save_as(array=final_data, dest_file_name="tidy_data.xlsx")
print("Clean data saved to tidy_data.xlsx")

The clean dataset saves successfully. It's ready for analysis.

Complete Cleaning Function

Combine all steps into one function. It handles the entire process.


def excel_to_tidy_table(file_path, output_path):
    """
    Convert messy Excel file to tidy data table
    """
    # Load data
    raw_data = pe.get_array(file_name=file_path)
    
    # Remove empty rows
    cleaned = remove_empty_rows(raw_data)
    
    # Handle missing values
    filled = handle_missing_values(cleaned)
    
    # Standardize types
    standardized = standardize_data_types(filled)
    
    # Remove extra columns
    final = remove_extra_columns(standardized)
    
    # Save result
    pe.save_as(array=final, dest_file_name=output_path)
    return final

# Use the complete function
tidy_data = excel_to_tidy_table("messy_data.xlsx", "clean_data.xlsx")
print("Complete cleaning process finished")
print(tidy_data)

This function automates the entire cleaning workflow.

Advanced pyexcel Features

pyexcel offers more advanced capabilities. You can batch process Excel files efficiently.

For complex data operations, consider loading Excel files into pandas.

Data validation is also important. Learn to validate spreadsheet structure types.

Best Practices

Always backup original files. Cleaning can't be undone.

Test cleaning steps on copies. Verify results carefully.

Document all transformations. Others need to understand changes.

Consistent data structure enables reliable analysis.

Conclusion

Python pyexcel transforms messy Excel data into tidy tables.

The process involves loading, cleaning, and saving data.

Remove empty rows and handle missing values properly.

Standardize data types and remove unnecessary columns.

Tidy data enables better analysis and decision making.

Start cleaning your Excel files with pyexcel today.