Last modified: Dec 08, 2024 By Alexander Williams

Python Pandas drop_duplicates() Simplified

The drop_duplicates() function in Pandas allows you to remove duplicate rows from a DataFrame, ensuring clean and consistent datasets.

What is drop_duplicates()?

drop_duplicates() removes duplicate rows based on the values in specified columns or the entire DataFrame.

Basic Syntax of drop_duplicates()


DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

Parameters:

  • subset: Specify columns to check for duplicates. Default is all columns.
  • keep: Options: 'first', 'last', or False (drop all duplicates).
  • inplace: Whether to modify the DataFrame in place. Default is False.

Removing Duplicate Rows

Here’s a basic example of removing duplicate rows:


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
        'Age': [25, 30, 25, 35]}

df = pd.DataFrame(data)

# Remove duplicate rows
cleaned_df = df.drop_duplicates()
print(cleaned_df)


    Name  Age
0  Alice   25
1    Bob   30
3  David   35

Using subset to Specify Columns

You can focus on specific columns to identify duplicates:


# Remove duplicates based on 'Name' column
unique_names = df.drop_duplicates(subset=['Name'])
print(unique_names)


    Name  Age
0  Alice   25
1    Bob   30
3  David   35

Dropping All Duplicates

To drop all occurrences of duplicate rows, use keep=False:


# Drop all duplicates
no_duplicates = df.drop_duplicates(keep=False)
print(no_duplicates)


    Name  Age
1    Bob   30
3  David   35

Using inplace=True

Modify the DataFrame without creating a new one:


# Remove duplicates in place
df.drop_duplicates(inplace=True)
print(df)


    Name  Age
0  Alice   25
1    Bob   30
3  David   35

Practical Applications

Use drop_duplicates() to:

  • Prepare data for machine learning models.
  • Clean datasets for better visualization.
  • Ensure data consistency in databases.

Check out our related guide on identifying duplicates with duplicated().

Key Differences: duplicated() vs drop_duplicates()

duplicated() marks duplicates without removing them, while drop_duplicates() eliminates them directly.

Conclusion

Understanding how to use drop_duplicates() is crucial for data preprocessing. This function ensures your datasets are clean and ready for analysis.

For advanced data manipulation, check out our article on grouping and aggregating data with groupby().