Last modified: Dec 08, 2024 By Alexander Williams

Python Pandas duplicated() Simplified

The duplicated() function in Pandas helps you find duplicate rows in your DataFrame or Series. It’s essential for data cleaning and preprocessing.

What is duplicated() in Pandas?

The duplicated() function returns a Boolean Series indicating whether each row or value is a duplicate.

Basic Syntax of duplicated()


DataFrame.duplicated(subset=None, keep='first')

Parameters:

  • subset: Columns to consider for identifying duplicates. Default is all columns.
  • keep: Determines which duplicate to mark as not duplicated. Options: 'first', 'last', or False.

Identifying Duplicate Rows

Here’s how to identify duplicate rows in a DataFrame:


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
        'Age': [25, 30, 25, 35]}

df = pd.DataFrame(data)

# Identify duplicates
duplicates = df.duplicated()
print(duplicates)


0    False
1    False
2     True
3    False
dtype: bool

Dropping Duplicate Rows

Combine duplicated() with drop() to remove duplicates:


# Drop duplicate rows
cleaned_df = df[~df.duplicated()]
print(cleaned_df)


    Name  Age
0  Alice   25
1    Bob   30
3  David   35

Using subset to Focus on Specific Columns

You can specify columns to check for duplicates:


# Check duplicates based on 'Name' column
duplicates_by_name = df.duplicated(subset=['Name'])
print(duplicates_by_name)


0    False
1    False
2     True
3    False
dtype: bool

Practical Applications

The duplicated() function is commonly used to:

  • Identify and handle duplicate entries in datasets.
  • Ensure data integrity before analysis.
  • Optimize datasets for processing.

Learn more about cleaning data with our article on dropping rows and columns in Pandas.

Comparison with drop_duplicates()

While duplicated() marks duplicates, drop_duplicates() removes them directly. Use duplicated() when you need to analyze duplicates before removing them.

Conclusion

The duplicated() function is a versatile tool for identifying and managing duplicate data in Pandas. It’s an essential step in data cleaning workflows.

For advanced data manipulation, check out our guide on grouping and aggregating data using groupby().