Last modified: Dec 22, 2025 By Alexander Williams
Clean Data with Python: A Step-by-Step Guide
Data is the fuel for modern analysis. But raw data is often messy. Cleaning it is a crucial first step. This guide shows you how to do it with Python.
We will use the powerful pandas library. It provides tools to handle common data issues. Let's start by importing it.
# Import the pandas library
import pandas as pd
Load Your Data
First, you need to get your data into Python. Pandas can read many formats. Common ones are CSV and Excel files.
For a deeper dive into data manipulation, see our Master Data Analysis with Pandas Python Guide.
# Load data from a CSV file
df = pd.read_csv('sales_data.csv')
print(df.head())
Order ID Customer Product Quantity Price
0 1001 Alice Laptop 1 999.99
1 1002 Bob Smartphone 2 499.50
2 1003 Charlie Tablet 1 299.00
3 1004 Diana Laptop 1 999.99
4 1005 NaN Smartphone 3 499.50
Handle Missing Values
Missing data is a common problem. It appears as NaN in pandas. You must decide how to handle it.
You can remove rows with missing values. Or you can fill them with a statistic. Use dropna() or fillna().
# Check for missing values
print(df.isnull().sum())
# Option 1: Drop rows with any missing values
df_dropped = df.dropna()
print(df_dropped.shape)
# Option 2: Fill missing customer names with 'Unknown'
df_filled = df.fillna({'Customer': 'Unknown'})
print(df_filled.head())
Order ID 0
Customer 1
Product 0
Quantity 0
Price 0
dtype: int64
(4, 5)
Order ID Customer Product Quantity Price
0 1001 Alice Laptop 1 999.99
1 1002 Bob Smartphone 2 499.50
2 1003 Charlie Tablet 1 299.00
3 1004 Diana Laptop 1 999.99
4 1005 Unknown Smartphone 3 499.50
Fix Data Types
Columns sometimes have the wrong data type. A price might be read as text. This stops math operations.
Use the astype() method to convert types. Ensure your data is numeric for calculations.
# Check current data types
print(df.dtypes)
# Convert 'Price' to float (if it was a string)
df['Price'] = df['Price'].astype(float)
print(df.dtypes)
Order ID int64
Customer object
Product object
Quantity int64
Price float64
dtype: object
Remove Duplicates
Duplicate rows can skew your analysis. They give extra weight to repeated information.
The drop_duplicates() method finds and removes them. It keeps the first instance by default.
# Check for duplicate rows
print(df.duplicated().sum())
# Remove duplicate rows
df_unique = df.drop_duplicates()
print(df_unique.shape)
Standardize Text Data
Text data is often inconsistent. Names may have different cases or extra spaces. This causes grouping errors.
Use string methods to standardize. Methods like str.lower() and str.strip() are very helpful.
# Standardize the 'Product' column
df['Product'] = df['Product'].str.lower().str.strip()
print(df['Product'].unique())
['laptop' 'smartphone' 'tablet']
Handle Outliers
Outliers are extreme values. They can distort averages and models. You need to identify them.
A common method is using the Interquartile Range (IQR). Values outside 1.5*IQR are often considered outliers.
# Calculate IQR for 'Quantity'
Q1 = df['Quantity'].quantile(0.25)
Q3 = df['Quantity'].quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out outliers
df_clean = df[(df['Quantity'] >= lower_bound) & (df['Quantity'] <= upper_bound)]
print(df_clean)
Combine Data from Multiple Sources
Data often comes from different files. You may need to merge or concatenate them. Pandas excels at this.
For working with Excel files specifically, learn to Integrate Python xlrd with pandas for Data Analysis.
# Example: Concatenate two DataFrames
df_new_data = pd.DataFrame({
'Order ID': [1006, 1007],
'Customer': ['Eve', 'Frank'],
'Product': ['tablet', 'laptop'],
'Quantity': [2, 1],
'Price': [299.00, 999.99]
})
df_combined = pd.concat([df, df_new_data], ignore_index=True)
print(df_combined.tail())
Conclusion
Cleaning data is a non-negotiable step in any analysis. This guide covered key techniques.
You learned to handle missing values, fix types, and remove duplicates. You also saw how to standardize text and manage outliers.
Using pandas makes this process efficient. Clean data leads to reliable insights. Start applying these steps to your projects today.