Last modified: Nov 25, 2025 By Alexander Williams

Build Simple ETL Pipelines Python pyexcel

ETL pipelines are essential for data processing. They extract, transform, and load data. Python makes ETL simple and powerful.

Pyexcel is a perfect tool for spreadsheet ETL. It handles Excel, CSV, and other formats. This guide shows you how to build pipelines.

What is ETL?

ETL stands for Extract, Transform, Load. It is a fundamental data process. Businesses use ETL daily for reporting.

Extraction gets data from sources. Transformation cleans and modifies data. Loading puts data into target systems.

Pyexcel simplifies working with spreadsheets. It provides a unified API for many formats. Your ETL code becomes format-agnostic.

Setting Up Pyexcel

First, install pyexcel and its plugins. Use pip for installation. Choose plugins for your file formats.


# Install pyexcel and Excel support
pip install pyexcel pyexcel-xlsx

# For other formats
pip install pyexcel-ods pyexcel-xls

Import pyexcel in your Python script. You are ready to build ETL pipelines. The library is now available.

Extracting Data with Pyexcel

Extraction is the first ETL step. Pyexcel can read from files, streams, or memory. The get_array function loads spreadsheet data.


import pyexcel as pe

# Extract data from Excel file
data_array = pe.get_array(file_name="sales_data.xlsx")

print("Extracted data:")
for row in data_array:
    print(row)

Extracted data:
['Date', 'Product', 'Sales', 'Region']
['2023-01-01', 'Widget A', 1500, 'North']
['2023-01-01', 'Widget B', 800, 'South']
['2023-01-02', 'Widget A', 1200, 'East']

For more complex extraction scenarios, you might need to batch process Excel files with Python pyexcel when dealing with multiple data sources.

Transforming Data

Transformation cleans and prepares data. Common tasks include filtering, calculating, and restructuring. Pyexcel data is easy to manipulate.


def transform_sales_data(raw_data):
    """Transform raw sales data"""
    # Skip header row
    data_rows = raw_data[1:]
    
    transformed = []
    for row in data_rows:
        date, product, sales, region = row
        
        # Calculate sales tax (10%)
        sales_tax = sales * 0.10
        total_amount = sales + sales_tax
        
        # Create new row with calculations
        new_row = [date, product, sales, sales_tax, total_amount, region]
        transformed.append(new_row)
    
    # Add new headers
    headers = ['Date', 'Product', 'Sales', 'Tax', 'Total', 'Region']
    return [headers] + transformed

# Apply transformation
transformed_data = transform_sales_data(data_array)

print("Transformed data:")
for row in transformed_data:
    print(row)

Transformed data:
['Date', 'Product', 'Sales', 'Tax', 'Total', 'Region']
['2023-01-01', 'Widget A', 1500, 150.0, 1650.0, 'North']
['2023-01-01', 'Widget B', 800, 80.0, 880.0, 'South']
['2023-01-02', 'Widget A', 1200, 120.0, 1320.0, 'East']

Transformation often requires creating tidy data tables from Excel using Python pyexcel to ensure data quality and consistency.

Loading Data

Loading saves processed data. Pyexcel can write to various formats. The save_as function handles output.


# Load transformed data to new Excel file
pe.save_as(array=transformed_data, dest_file_name="processed_sales.xlsx")

# Or save as CSV
pe.save_as(array=transformed_data, dest_file_name="processed_sales.csv")

print("Data loaded successfully")

For database integration, you can export database query results to Excel with Python pyexcel as part of your loading process.

Complete ETL Pipeline Example

Here is a complete ETL pipeline. It combines extraction, transformation, and loading. This example processes sales data.


import pyexcel as pe

def etl_pipeline(input_file, output_file):
    """Complete ETL pipeline for sales data"""
    
    # EXTRACT
    print("Extracting data...")
    raw_data = pe.get_array(file_name=input_file)
    
    # TRANSFORM
    print("Transforming data...")
    
    # Remove empty rows and clean data
    cleaned_data = [row for row in raw_data if any(cell for cell in row)]
    
    # Add performance metrics
    headers = cleaned_data[0] + ['Performance']
    transformed_rows = [headers]
    
    for row in cleaned_data[1:]:
        if len(row) >= 3:  # Ensure valid data
            sales = row[2] if isinstance(row[2], (int, float)) else 0
            performance = 'High' if sales > 1000 else 'Standard'
            transformed_rows.append(row + [performance])
    
    # LOAD
    print("Loading data...")
    pe.save_as(array=transformed_rows, dest_file_name=output_file)
    
    print(f"ETL pipeline complete. Output: {output_file}")
    return transformed_rows

# Run the pipeline
result = etl_pipeline("input_sales.xlsx", "output_sales.xlsx")

# Display sample of final data
print("Final data sample:")
for i, row in enumerate(result[:3]):
    print(row)

Extracting data...
Transforming data...
Loading data...
ETL pipeline complete. Output: output_sales.xlsx
Final data sample:
['Date', 'Product', 'Sales', 'Region', 'Performance']
['2023-01-01', 'Widget A', 1500, 'North', 'High']
['2023-01-01', 'Widget B', 800, 'South', 'Standard']

Error Handling and Validation

Robust ETL pipelines need error handling. Validate data quality and handle exceptions. This prevents pipeline failures.


def safe_etl_pipeline(input_file, output_file):
    """ETL pipeline with error handling"""
    try:
        # Extract with validation
        if not input_file.endswith(('.xlsx', '.xls', '.csv')):
            raise ValueError("Unsupported file format")
        
        raw_data = pe.get_array(file_name=input_file)
        
        if not raw_data or len(raw_data) < 2:
            raise ValueError("No data found in file")
        
        # Transform with data validation
        transformed_data = []
        for i, row in enumerate(raw_data):
            if i == 0:  # Header row
                transformed_data.append(row + ['Status'])
            else:
                # Validate sales data
                try:
                    sales = float(row[2]) if len(row) > 2 else 0
                    status = 'Valid' if sales >= 0 else 'Invalid'
                    transformed_data.append(row + [status])
                except (ValueError, TypeError):
                    transformed_data.append(row + ['Data Error'])
        
        # Load data
        pe.save_as(array=transformed_data, dest_file_name=output_file)
        print("ETL completed successfully")
        
    except Exception as e:
        print(f"ETL pipeline failed: {str(e)}")
        return None

# Test the robust pipeline
safe_etl_pipeline("sales_data.xlsx", "validated_sales.xlsx")

Best Practices for Pyexcel ETL

Follow these best practices. They ensure reliable and maintainable pipelines. Your ETL processes will be production-ready.

Use meaningful variable names. Code becomes self-documenting. Other developers can understand your logic.

Implement proper error handling. ETL pipelines fail without validation. Catch exceptions and log errors.

Test with sample data first. Verify transformations work correctly. Then scale to production data.

Document data transformations. Keep records of business rules. This helps with debugging and audits.

Conclusion

Pyexcel makes ETL pipeline development accessible. You can extract, transform, and load spreadsheet data efficiently. The library handles multiple formats seamlessly.

Start with simple pipelines. Add complexity as needed. Remember to validate data and handle errors.

Python and pyexcel provide a powerful combination. You can build robust data processing systems. Your business intelligence will improve significantly.

Explore more pyexcel features as you advance. The library offers many capabilities for data manipulation. Your ETL pipelines will become more sophisticated over time.