Last modified: Dec 22, 2025 By Alexander Williams

Clean Data with Python: A Step-by-Step Guide

Data is the fuel for modern analysis. But raw data is often messy. Cleaning it is a crucial first step. This guide shows you how to do it with Python.

We will use the powerful pandas library. It provides tools to handle common data issues. Let's start by importing it.


# Import the pandas library
import pandas as pd
    

Load Your Data

First, you need to get your data into Python. Pandas can read many formats. Common ones are CSV and Excel files.

For a deeper dive into data manipulation, see our Master Data Analysis with Pandas Python Guide.


# Load data from a CSV file
df = pd.read_csv('sales_data.csv')
print(df.head())
    

   Order ID  Customer      Product  Quantity  Price
0      1001     Alice      Laptop         1   999.99
1      1002       Bob  Smartphone         2   499.50
2      1003   Charlie        Tablet         1   299.00
3      1004     Diana      Laptop         1   999.99
4      1005        NaN  Smartphone         3   499.50
    

Handle Missing Values

Missing data is a common problem. It appears as NaN in pandas. You must decide how to handle it.

You can remove rows with missing values. Or you can fill them with a statistic. Use dropna() or fillna().


# Check for missing values
print(df.isnull().sum())

# Option 1: Drop rows with any missing values
df_dropped = df.dropna()
print(df_dropped.shape)

# Option 2: Fill missing customer names with 'Unknown'
df_filled = df.fillna({'Customer': 'Unknown'})
print(df_filled.head())
    

Order ID       0
Customer       1
Product        0
Quantity       0
Price          0
dtype: int64
(4, 5)
   Order ID  Customer      Product  Quantity  Price
0      1001     Alice      Laptop         1   999.99
1      1002       Bob  Smartphone         2   499.50
2      1003   Charlie        Tablet         1   299.00
3      1004     Diana      Laptop         1   999.99
4      1005    Unknown  Smartphone         3   499.50
    

Fix Data Types

Columns sometimes have the wrong data type. A price might be read as text. This stops math operations.

Use the astype() method to convert types. Ensure your data is numeric for calculations.


# Check current data types
print(df.dtypes)

# Convert 'Price' to float (if it was a string)
df['Price'] = df['Price'].astype(float)
print(df.dtypes)
    

Order ID      int64
Customer     object
Product      object
Quantity      int64
Price       float64
dtype: object
    

Remove Duplicates

Duplicate rows can skew your analysis. They give extra weight to repeated information.

The drop_duplicates() method finds and removes them. It keeps the first instance by default.


# Check for duplicate rows
print(df.duplicated().sum())

# Remove duplicate rows
df_unique = df.drop_duplicates()
print(df_unique.shape)
    

Standardize Text Data

Text data is often inconsistent. Names may have different cases or extra spaces. This causes grouping errors.

Use string methods to standardize. Methods like str.lower() and str.strip() are very helpful.


# Standardize the 'Product' column
df['Product'] = df['Product'].str.lower().str.strip()
print(df['Product'].unique())
    

['laptop' 'smartphone' 'tablet']
    

Handle Outliers

Outliers are extreme values. They can distort averages and models. You need to identify them.

A common method is using the Interquartile Range (IQR). Values outside 1.5*IQR are often considered outliers.


# Calculate IQR for 'Quantity'
Q1 = df['Quantity'].quantile(0.25)
Q3 = df['Quantity'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_clean = df[(df['Quantity'] >= lower_bound) & (df['Quantity'] <= upper_bound)]
print(df_clean)
    

Combine Data from Multiple Sources

Data often comes from different files. You may need to merge or concatenate them. Pandas excels at this.

For working with Excel files specifically, learn to Integrate Python xlrd with pandas for Data Analysis.


# Example: Concatenate two DataFrames
df_new_data = pd.DataFrame({
    'Order ID': [1006, 1007],
    'Customer': ['Eve', 'Frank'],
    'Product': ['tablet', 'laptop'],
    'Quantity': [2, 1],
    'Price': [299.00, 999.99]
})

df_combined = pd.concat([df, df_new_data], ignore_index=True)
print(df_combined.tail())
    

Conclusion

Cleaning data is a non-negotiable step in any analysis. This guide covered key techniques.

You learned to handle missing values, fix types, and remove duplicates. You also saw how to standardize text and manage outliers.

Using pandas makes this process efficient. Clean data leads to reliable insights. Start applying these steps to your projects today.