Last modified: Dec 08, 2024 By Alexander Williams
Python Pandas drop_duplicates() Simplified
The drop_duplicates()
function in Pandas allows you to remove duplicate rows from a DataFrame, ensuring clean and consistent datasets.
What is drop_duplicates()?
drop_duplicates()
removes duplicate rows based on the values in specified columns or the entire DataFrame.
Basic Syntax of drop_duplicates()
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
Parameters:
subset
: Specify columns to check for duplicates. Default is all columns.keep
: Options:'first'
,'last'
, orFalse
(drop all duplicates).inplace
: Whether to modify the DataFrame in place. Default isFalse
.
Removing Duplicate Rows
Here’s a basic example of removing duplicate rows:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 35]}
df = pd.DataFrame(data)
# Remove duplicate rows
cleaned_df = df.drop_duplicates()
print(cleaned_df)
Name Age
0 Alice 25
1 Bob 30
3 David 35
Using subset to Specify Columns
You can focus on specific columns to identify duplicates:
# Remove duplicates based on 'Name' column
unique_names = df.drop_duplicates(subset=['Name'])
print(unique_names)
Name Age
0 Alice 25
1 Bob 30
3 David 35
Dropping All Duplicates
To drop all occurrences of duplicate rows, use keep=False
:
# Drop all duplicates
no_duplicates = df.drop_duplicates(keep=False)
print(no_duplicates)
Name Age
1 Bob 30
3 David 35
Using inplace=True
Modify the DataFrame without creating a new one:
# Remove duplicates in place
df.drop_duplicates(inplace=True)
print(df)
Name Age
0 Alice 25
1 Bob 30
3 David 35
Practical Applications
Use drop_duplicates()
to:
- Prepare data for machine learning models.
- Clean datasets for better visualization.
- Ensure data consistency in databases.
Check out our related guide on identifying duplicates with duplicated().
Key Differences: duplicated() vs drop_duplicates()
duplicated()
marks duplicates without removing them, while drop_duplicates()
eliminates them directly.
Conclusion
Understanding how to use drop_duplicates()
is crucial for data preprocessing. This function ensures your datasets are clean and ready for analysis.
For advanced data manipulation, check out our article on grouping and aggregating data with groupby().