Last modified: May 11, 2026 By Alexander Williams

Build a Data Pipeline with Polars

Data pipelines are the backbone of modern analytics. They move data from source to destination. They clean and transform it along the way.

Polars is a fast DataFrame library for Python. It is built in Rust. It handles large datasets with ease. It is perfect for building ETL pipelines.

This article shows a complete ETL example. We will extract, transform, and load data using Polars. The code is simple and readable. You can adapt it for your own projects.

If you are new to Polars, check our Polars vs Pandas: Real Benchmarks to see why it is so fast.

What is an ETL Pipeline?

ETL stands for Extract, Transform, Load. It is a three-step process.

  • Extract: Read data from sources like CSV, JSON, or databases.
  • Transform: Clean, filter, and reshape the data.
  • Load: Write the final data to a destination like a file or database.

Polars makes each step fast and memory-efficient.

Setting Up Your Environment

First, install Polars. Use pip.


# Install Polars
pip install polars

We also need sample data. Create a CSV file called sales.csv.


# sales.csv content
date,product,quantity,price
2024-01-01,Widget A,10,5.99
2024-01-01,Widget B,5,12.49
2024-01-02,Widget A,8,5.99
2024-01-02,Widget C,3,8.99
2024-01-03,Widget B,12,12.49
2024-01-03,Widget A,15,5.99
2024-01-04,Widget C,7,8.99
2024-01-04,Widget B,9,12.49

Step 1: Extract Data with Polars

We use pl.read_csv() to extract data. It is fast and lazy-friendly.


import polars as pl

# Extract data from CSV
df = pl.read_csv("sales.csv")
print(df)

shape: (8, 4)
┌────────────┬───────────┬──────────┬───────┐
│ date       ┆ product   ┆ quantity ┆ price │
│ ---        ┆ ---       ┆ ---      ┆ ---   │
│ str        ┆ str       ┆ i64      ┆ f64   │
╞════════════╪═══════════╪══════════╪═══════╡
│ 2024-01-01 ┆ Widget A  ┆ 10       ┆ 5.99  │
│ 2024-01-01 ┆ Widget B  ┆ 5        ┆ 12.49 │
│ 2024-01-02 ┆ Widget A  ┆ 8        ┆ 5.99  │
│ 2024-01-02 ┆ Widget C  ┆ 3        ┆ 8.99  │
│ 2024-01-03 ┆ Widget B  ┆ 12       ┆ 12.49 │
│ 2024-01-03 ┆ Widget A  ┆ 15       ┆ 5.99  │
│ 2024-01-04 ┆ Widget C  ┆ 7        ┆ 8.99  │
│ 2024-01-04 ┆ Widget B  ┆ 9        ┆ 12.49 │
└────────────┴───────────┴──────────┴───────┘

For large files, use pl.scan_csv() instead. It reads lazily. See our guide on Scan Large Files with Polars Without Memory Load for details.

Step 2: Transform Data

Transformation is where Polars shines. We chain expressions for clean code.

2.1 Parse Dates

The date column is a string. Convert it to a date type.


df = df.with_columns(
    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d")
)
print(df)

shape: (8, 4)
┌────────────┬───────────┬──────────┬───────┐
│ date       ┆ product   ┆ quantity ┆ price │
│ ---        ┆ ---       ┆ ---      ┆ ---   │
│ date       ┆ str       ┆ i64      ┆ f64   │
╞════════════╪═══════════╪══════════╪═══════╡
│ 2024-01-01 ┆ Widget A  ┆ 10       ┆ 5.99  │
│ 2024-01-01 ┆ Widget B  ┆ 5        ┆ 12.49 │
...

2.2 Add Calculated Columns

Add a total sales column. Multiply quantity by price.


df = df.with_columns(
    (pl.col("quantity") * pl.col("price")).alias("total_sales")
)
print(df)

shape: (8, 5)
┌────────────┬───────────┬──────────┬───────┬─────────────┐
│ date       ┆ product   ┆ quantity ┆ price ┆ total_sales │
│ ---        ┆ ---       ┆ ---      ┆ ---   ┆ ---         │
│ date       ┆ str       ┆ i64      ┆ f64   ┆ f64         │
╞════════════╪═══════════╪══════════╪═══════╪═════════════╡
│ 2024-01-01 ┆ Widget A  ┆ 10       ┆ 5.99  ┆ 59.9        │
│ 2024-01-01 ┆ Widget B  ┆ 5        ┆ 12.49 ┆ 62.45       │
...

2.3 Filter and Aggregate

Keep only sales above $50. Then group by product.


df_filtered = df.filter(pl.col("total_sales") > 50.0)

df_aggregated = df_filtered.group_by("product").agg([
    pl.sum("quantity").alias("total_quantity"),
    pl.sum("total_sales").alias("revenue")
])

print(df_aggregated)

shape: (3, 3)
┌───────────┬────────────────┬─────────┐
│ product   ┆ total_quantity ┆ revenue │
│ ---       ┆ ---            ┆ ---     │
│ str       ┆ i64            ┆ f64     │
╞═══════════╪════════════════╪═════════╡
│ Widget B  ┆ 26             ┆ 324.74  │
│ Widget A  ┆ 25             ┆ 149.75  │
│ Widget C  ┆ 10             ┆ 89.9    │
└───────────┴────────────────┴─────────┘

This is a simple example. For more complex logic, use map_elements or map_batches. Read our guide on Polars Custom Functions with map_elements & map_batches.

Step 3: Load Data

Loading is the final step. Write the transformed data to a new file.


# Write to Parquet (fast, compressed)
df_aggregated.write_parquet("sales_summary.parquet")

# Also write to CSV for inspection
df_aggregated.write_csv("sales_summary.csv")
print("Data loaded successfully!")

Data loaded successfully!

Parquet is great for storage. It is columnar and fast to read. CSV is good for sharing with others.

Full ETL Pipeline Script

Here is the complete script. It combines all steps.


import polars as pl

def run_etl():
    # Extract
    df = pl.read_csv("sales.csv")
    
    # Transform
    df = df.with_columns(
        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d")
    )
    df = df.with_columns(
        (pl.col("quantity") * pl.col("price")).alias("total_sales")
    )
    df_filtered = df.filter(pl.col("total_sales") > 50.0)
    df_result = df_filtered.group_by("product").agg([
        pl.sum("quantity").alias("total_quantity"),
        pl.sum("total_sales").alias("revenue")
    ])
    
    # Load
    df_result.write_parquet("sales_summary.parquet")
    df_result.write_csv("sales_summary.csv")
    
    return df_result

if __name__ == "__main__":
    final_data = run_etl()
    print(final_data)

Performance Tips for Your Pipeline

Polars is fast by default. But you can make it even faster.

  • Use lazy evaluation with pl.scan_csv() for large files. Polars optimizes the query plan.
  • Chain expressions instead of creating intermediate DataFrames. This reduces memory use.
  • Use Parquet for output. It is faster than CSV for later reads.

For deeper tuning, see our guide on Polars Multi-threading & Performance Tuning.

Conclusion

Building a data pipeline with Polars is simple and powerful. The code is clean and readable. The performance is excellent.

We extracted data from CSV. We transformed it with date parsing, calculations, and aggregation. We loaded it to Parquet and CSV.

Polars handles the heavy lifting. You focus on the logic. Start building your own ETL pipeline today.

For more advanced topics, explore our Polars Chaining Expressions Guide.