Last modified: Nov 25, 2025 By Alexander Williams
Build Simple ETL Pipelines Python pyexcel
ETL pipelines are essential for data processing. They extract, transform, and load data. Python makes ETL simple and powerful.
Pyexcel is a perfect tool for spreadsheet ETL. It handles Excel, CSV, and other formats. This guide shows you how to build pipelines.
What is ETL?
ETL stands for Extract, Transform, Load. It is a fundamental data process. Businesses use ETL daily for reporting.
Extraction gets data from sources. Transformation cleans and modifies data. Loading puts data into target systems.
Pyexcel simplifies working with spreadsheets. It provides a unified API for many formats. Your ETL code becomes format-agnostic.
Setting Up Pyexcel
First, install pyexcel and its plugins. Use pip for installation. Choose plugins for your file formats.
# Install pyexcel and Excel support
pip install pyexcel pyexcel-xlsx
# For other formats
pip install pyexcel-ods pyexcel-xls
Import pyexcel in your Python script. You are ready to build ETL pipelines. The library is now available.
Extracting Data with Pyexcel
Extraction is the first ETL step. Pyexcel can read from files, streams, or memory. The get_array function loads spreadsheet data.
import pyexcel as pe
# Extract data from Excel file
data_array = pe.get_array(file_name="sales_data.xlsx")
print("Extracted data:")
for row in data_array:
print(row)
Extracted data:
['Date', 'Product', 'Sales', 'Region']
['2023-01-01', 'Widget A', 1500, 'North']
['2023-01-01', 'Widget B', 800, 'South']
['2023-01-02', 'Widget A', 1200, 'East']
For more complex extraction scenarios, you might need to batch process Excel files with Python pyexcel when dealing with multiple data sources.
Transforming Data
Transformation cleans and prepares data. Common tasks include filtering, calculating, and restructuring. Pyexcel data is easy to manipulate.
def transform_sales_data(raw_data):
"""Transform raw sales data"""
# Skip header row
data_rows = raw_data[1:]
transformed = []
for row in data_rows:
date, product, sales, region = row
# Calculate sales tax (10%)
sales_tax = sales * 0.10
total_amount = sales + sales_tax
# Create new row with calculations
new_row = [date, product, sales, sales_tax, total_amount, region]
transformed.append(new_row)
# Add new headers
headers = ['Date', 'Product', 'Sales', 'Tax', 'Total', 'Region']
return [headers] + transformed
# Apply transformation
transformed_data = transform_sales_data(data_array)
print("Transformed data:")
for row in transformed_data:
print(row)
Transformed data:
['Date', 'Product', 'Sales', 'Tax', 'Total', 'Region']
['2023-01-01', 'Widget A', 1500, 150.0, 1650.0, 'North']
['2023-01-01', 'Widget B', 800, 80.0, 880.0, 'South']
['2023-01-02', 'Widget A', 1200, 120.0, 1320.0, 'East']
Transformation often requires creating tidy data tables from Excel using Python pyexcel to ensure data quality and consistency.
Loading Data
Loading saves processed data. Pyexcel can write to various formats. The save_as function handles output.
# Load transformed data to new Excel file
pe.save_as(array=transformed_data, dest_file_name="processed_sales.xlsx")
# Or save as CSV
pe.save_as(array=transformed_data, dest_file_name="processed_sales.csv")
print("Data loaded successfully")
For database integration, you can export database query results to Excel with Python pyexcel as part of your loading process.
Complete ETL Pipeline Example
Here is a complete ETL pipeline. It combines extraction, transformation, and loading. This example processes sales data.
import pyexcel as pe
def etl_pipeline(input_file, output_file):
"""Complete ETL pipeline for sales data"""
# EXTRACT
print("Extracting data...")
raw_data = pe.get_array(file_name=input_file)
# TRANSFORM
print("Transforming data...")
# Remove empty rows and clean data
cleaned_data = [row for row in raw_data if any(cell for cell in row)]
# Add performance metrics
headers = cleaned_data[0] + ['Performance']
transformed_rows = [headers]
for row in cleaned_data[1:]:
if len(row) >= 3: # Ensure valid data
sales = row[2] if isinstance(row[2], (int, float)) else 0
performance = 'High' if sales > 1000 else 'Standard'
transformed_rows.append(row + [performance])
# LOAD
print("Loading data...")
pe.save_as(array=transformed_rows, dest_file_name=output_file)
print(f"ETL pipeline complete. Output: {output_file}")
return transformed_rows
# Run the pipeline
result = etl_pipeline("input_sales.xlsx", "output_sales.xlsx")
# Display sample of final data
print("Final data sample:")
for i, row in enumerate(result[:3]):
print(row)
Extracting data...
Transforming data...
Loading data...
ETL pipeline complete. Output: output_sales.xlsx
Final data sample:
['Date', 'Product', 'Sales', 'Region', 'Performance']
['2023-01-01', 'Widget A', 1500, 'North', 'High']
['2023-01-01', 'Widget B', 800, 'South', 'Standard']
Error Handling and Validation
Robust ETL pipelines need error handling. Validate data quality and handle exceptions. This prevents pipeline failures.
def safe_etl_pipeline(input_file, output_file):
"""ETL pipeline with error handling"""
try:
# Extract with validation
if not input_file.endswith(('.xlsx', '.xls', '.csv')):
raise ValueError("Unsupported file format")
raw_data = pe.get_array(file_name=input_file)
if not raw_data or len(raw_data) < 2:
raise ValueError("No data found in file")
# Transform with data validation
transformed_data = []
for i, row in enumerate(raw_data):
if i == 0: # Header row
transformed_data.append(row + ['Status'])
else:
# Validate sales data
try:
sales = float(row[2]) if len(row) > 2 else 0
status = 'Valid' if sales >= 0 else 'Invalid'
transformed_data.append(row + [status])
except (ValueError, TypeError):
transformed_data.append(row + ['Data Error'])
# Load data
pe.save_as(array=transformed_data, dest_file_name=output_file)
print("ETL completed successfully")
except Exception as e:
print(f"ETL pipeline failed: {str(e)}")
return None
# Test the robust pipeline
safe_etl_pipeline("sales_data.xlsx", "validated_sales.xlsx")
Best Practices for Pyexcel ETL
Follow these best practices. They ensure reliable and maintainable pipelines. Your ETL processes will be production-ready.
Use meaningful variable names. Code becomes self-documenting. Other developers can understand your logic.
Implement proper error handling. ETL pipelines fail without validation. Catch exceptions and log errors.
Test with sample data first. Verify transformations work correctly. Then scale to production data.
Document data transformations. Keep records of business rules. This helps with debugging and audits.
Conclusion
Pyexcel makes ETL pipeline development accessible. You can extract, transform, and load spreadsheet data efficiently. The library handles multiple formats seamlessly.
Start with simple pipelines. Add complexity as needed. Remember to validate data and handle errors.
Python and pyexcel provide a powerful combination. You can build robust data processing systems. Your business intelligence will improve significantly.
Explore more pyexcel features as you advance. The library offers many capabilities for data manipulation. Your ETL pipelines will become more sophisticated over time.