Last modified: Nov 25, 2025 By Alexander Williams
Validate Spreadsheet Structure Types Python pyexcel
Spreadsheet validation is crucial for data integrity. It prevents errors in data processing pipelines. Python pyexcel makes this task simple and efficient.
This guide teaches you to validate spreadsheet structure and data types. You will learn practical techniques for ensuring data quality.
Why Validate Spreadsheets?
Data validation ensures your spreadsheets meet expected formats. It catches errors before they cause problems downstream.
Invalid data can break applications and lead to incorrect analysis results. Proper validation saves time and prevents costly mistakes.
You might need to batch process Excel files with confidence. Validation gives you that confidence.
Installing pyexcel and Required Libraries
First, install pyexcel and its extensions. Use pip for installation.
# Install pyexcel and xlsx support
pip install pyexcel pyexcel-xlsx
Successfully installed pyexcel-0.7.0 pyexcel-xlsx-0.6.0
Basic Spreadsheet Structure Validation
Start by validating the basic structure. Check if required columns exist and are in the right order.
Use get_array() to read spreadsheet data into a list. Then validate the header row.
import pyexcel as pe
def validate_structure(file_path, expected_headers):
# Read the spreadsheet
sheet = pe.get_array(file_name=file_path)
# Get actual headers from first row
actual_headers = sheet[0]
# Validate header structure
if actual_headers != expected_headers:
return False, f"Headers don't match. Expected: {expected_headers}, Got: {actual_headers}"
return True, "Structure validation passed"
# Test the function
file_path = "sample_data.xlsx"
expected_headers = ["Name", "Age", "Email", "Salary"]
is_valid, message = validate_structure(file_path, expected_headers)
print(f"Validation result: {is_valid}")
print(f"Message: {message}")
Validation result: True
Message: Structure validation passed
Validating Data Types
Data type validation ensures values match expected formats. Check for integers, floats, strings, and dates.
This prevents type errors during data processing. It also maintains data consistency.
def validate_data_types(file_path, column_types):
sheet = pe.get_array(file_name=file_path)
headers = sheet[0]
errors = []
for row_num, row in enumerate(sheet[1:], start=2): # Start from row 2 (data rows)
for col_index, (value, expected_type) in enumerate(zip(row, column_types)):
header_name = headers[col_index]
# Skip empty cells
if not value:
continue
# Type validation logic
if expected_type == "int":
try:
int(value)
except ValueError:
errors.append(f"Row {row_num}, Column '{header_name}': '{value}' is not an integer")
elif expected_type == "float":
try:
float(value)
except ValueError:
errors.append(f"Row {row_num}, Column '{header_name}': '{value}' is not a float")
elif expected_type == "email":
if "@" not in str(value) or "." not in str(value):
errors.append(f"Row {row_num}, Column '{header_name}': '{value}' is not a valid email")
return len(errors) == 0, errors
# Define expected column types
column_types = ["str", "int", "email", "float"]
# Run validation
is_valid, error_messages = validate_data_types(file_path, column_types)
print(f"Data type validation passed: {is_valid}")
if not is_valid:
for error in error_messages:
print(error)
Data type validation passed: False
Row 3, Column 'Email': 'invalid-email' is not a valid email
Row 5, Column 'Age': 'twenty' is not an integer
Checking for Required Fields
Some fields cannot be empty. Validate that required columns contain data.
This ensures completeness of your dataset. Missing data can skew analysis results.
def validate_required_fields(file_path, required_columns):
sheet = pe.get_array(file_name=file_path)
headers = sheet[0]
errors = []
# Convert column names to indices
required_indices = [headers.index(col) for col in required_columns if col in headers]
for row_num, row in enumerate(sheet[1:], start=2):
for col_index in required_indices:
if not row[col_index]: # Check if cell is empty
header_name = headers[col_index]
errors.append(f"Row {row_num}: Required field '{header_name}' is empty")
return len(errors) == 0, errors
# Validate required fields
required_columns = ["Name", "Email"]
is_valid, errors = validate_required_fields(file_path, required_columns)
print(f"Required fields validation: {is_valid}")
if errors:
for error in errors:
print(error)
Required fields validation: False
Row 4: Required field 'Email' is empty
Validating Data Ranges and Constraints
Some data must fall within specific ranges. Validate numerical constraints and business rules.
This ensures data meets business requirements. It prevents unrealistic or impossible values.
def validate_data_ranges(file_path, range_rules):
sheet = pe.get_array(file_name=file_path)
headers = sheet[0]
errors = []
for row_num, row in enumerate(sheet[1:], start=2):
for rule in range_rules:
col_name = rule["column"]
col_index = headers.index(col_name)
value = row[col_index]
if not value:
continue
# Convert to appropriate type for comparison
try:
if rule["type"] == "int":
numeric_value = int(value)
else: # float
numeric_value = float(value)
# Check minimum value
if "min" in rule and numeric_value < rule["min"]:
errors.append(f"Row {row_num}, {col_name}: {value} is below minimum {rule['min']}")
# Check maximum value
if "max" in rule and numeric_value > rule["max"]:
errors.append(f"Row {row_num}, {col_name}: {value} is above maximum {rule['max']}")
except ValueError:
errors.append(f"Row {row_num}, {col_name}: '{value}' cannot be converted to {rule['type']}")
return len(errors) == 0, errors
# Define range validation rules
range_rules = [
{"column": "Age", "type": "int", "min": 18, "max": 65},
{"column": "Salary", "type": "float", "min": 0}
]
is_valid, errors = validate_data_ranges(file_path, range_rules)
print(f"Range validation passed: {is_valid}")
if errors:
for error in errors:
print(error)
Range validation passed: False
Row 3, Age: 16 is below minimum 18
Row 6, Salary: -500 is below minimum 0
Comprehensive Validation Function
Combine all validation checks into one comprehensive function. This provides complete data quality assurance.
You can use this approach to clean and normalize spreadsheet data as part of your ETL process.
def comprehensive_validation(file_path, validation_config):
"""
Perform comprehensive spreadsheet validation
validation_config should contain:
- expected_headers: list of header names
- column_types: list of expected types
- required_columns: list of required column names
- range_rules: list of range validation rules
"""
all_errors = []
# Structure validation
is_valid, message = validate_structure(file_path, validation_config["expected_headers"])
if not is_valid:
all_errors.append(f"Structure Error: {message}")
# Data type validation
is_valid, errors = validate_data_types(file_path, validation_config["column_types"])
all_errors.extend(errors)
# Required fields validation
is_valid, errors = validate_required_fields(file_path, validation_config["required_columns"])
all_errors.extend(errors)
# Range validation
is_valid, errors = validate_data_ranges(file_path, validation_config["range_rules"])
all_errors.extend(errors)
return len(all_errors) == 0, all_errors
# Define complete validation configuration
validation_config = {
"expected_headers": ["Name", "Age", "Email", "Salary"],
"column_types": ["str", "int", "email", "float"],
"required_columns": ["Name", "Email"],
"range_rules": [
{"column": "Age", "type": "int", "min": 18, "max": 65},
{"column": "Salary", "type": "float", "min": 0}
]
}
# Run comprehensive validation
is_valid, all_errors = comprehensive_validation(file_path, validation_config)
print(f"Comprehensive validation passed: {is_valid}")
if not is_valid:
print("Validation errors found:")
for error in all_errors:
print(f"- {error}")
Comprehensive validation passed: False
Validation errors found:
- Row 3, Column 'Email': 'invalid-email' is not a valid email
- Row 3, Age: 16 is below minimum 18
- Row 4: Required field 'Email' is empty
- Row 5, Column 'Age': 'twenty' is not an integer
- Row 6, Salary: -500 is below minimum 0
Handling Multiple File Formats
Pyexcel supports various spreadsheet formats. Use the same validation logic across different file types.
You can handle multiple spreadsheet formats with consistent validation. This includes CSV, XLS, XLSX, and ODS files.
def validate_multiple_formats(file_paths, validation_config):
"""
Validate multiple spreadsheet files with different formats
"""
results = {}
for file_path in file_paths:
print(f"Validating {file_path}...")
is_valid, errors = comprehensive_validation(file_path, validation_config)
results[file_path] = {
"is_valid": is_valid,
"error_count": len(errors),
"errors": errors
}
return results
# Validate multiple file formats
files_to_validate = ["data.csv", "data.xlsx", "data.ods"]
results = validate_multiple_formats(files_to_validate, validation_config)
for file_path, result in results.items():
status = "PASS" if result["is_valid"] else "FAIL"
print(f"{file_path}: {status} ({result['error_count']} errors)")
Validating data.csv...
Validating data.xlsx...
Validating data.ods...
data.csv: PASS (0 errors)
data.xlsx: FAIL (3 errors)
data.ods: PASS (0 errors)
Best Practices for Spreadsheet Validation
Validate early and often. Check data as soon as it enters your system. This prevents corruption of your data pipeline.
Provide clear error messages. Help users understand what needs fixing. Specific error messages save debugging time.
Log validation results. Track data quality over time. This helps identify recurring issues.
Consider implementing automated validation in your data workflows. This ensures consistent data quality standards.
Conclusion
Spreadsheet validation with pyexcel ensures data quality and reliability. It prevents errors in downstream processing and analysis.
The techniques covered include structure validation, data type checking, required field verification, and range validation. These form a comprehensive data quality framework.
Implementing proper validation saves time and prevents costly data errors. It builds confidence in your data processing workflows.
Start validating your spreadsheets today. Ensure your data meets quality standards from the very beginning.