Last modified: Dec 04, 2024 By Alexander Williams

Python Pandas merge(): A Comprehensive Guide

The pandas.merge() function in Python is used to combine two DataFrames based on a common column or index. It provides flexibility and a variety of options to merge datasets in different ways. This guide will explain how to use merge(), with examples and common scenarios.

What is pandas.merge()?

The pandas.merge() function allows you to join two DataFrames based on one or more columns. This operation is essential when working with structured data, where different datasets may contain related information across various columns.

Syntax of pandas.merge()

The basic syntax of the merge() function is as follows:


pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, 
            left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'))

Here are the most important parameters:

  • left: The first DataFrame to merge.
  • right: The second DataFrame to merge.
  • how: Type of join (e.g., 'inner', 'outer', 'left', 'right').
  • on: Column(s) to join on (if not specified, defaults to the intersection of column names).
  • left_on and right_on: Columns to join on from the left and right DataFrames, respectively.
  • left_index and right_index: Use the index (instead of columns) for joining.
  • suffixes: Suffixes to apply to overlapping column names.

Types of Joins in pandas.merge()

The how parameter defines the type of join to perform. The most common types are:

  • inner: Returns only the rows where there is a match in both DataFrames.
  • outer: Returns all rows, with matching rows from both sides. Missing values are filled with NaN.
  • left: Returns all rows from the left DataFrame, with matching rows from the right DataFrame. Non-matching rows will have NaN for missing values.
  • right: Returns all rows from the right DataFrame, with matching rows from the left DataFrame. Non-matching rows will have NaN for missing values.

Example: Inner Join

Let's see an example where we perform an inner join. An inner join returns only the rows that have matching values in both DataFrames.


import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'age': [24, 27, 30]})

# Merge DataFrames using inner join
result = pd.merge(df1, df2, on='id', how='inner')
print(result)

Output:


   id     name  age
0   2      Bob   24
1   3  Charlie   27

In the output, only rows with matching 'id' values from both DataFrames are returned.

Example: Left Join

Next, let's see how a left join works. A left join returns all rows from the left DataFrame, and matching rows from the right DataFrame.


# Merge DataFrames using left join
result = pd.merge(df1, df2, on='id', how='left')
print(result)

Output:


   id     name   age
0   1    Alice   NaN
1   2      Bob  24.0
2   3  Charlie  27.0

Here, the row with 'id' = 1 from the left DataFrame does not have a match in the right DataFrame, so the 'age' value is NaN.

Using Multiple Columns for Merging

You can also merge DataFrames using multiple columns. To do this, pass a list of column names to the on parameter.


# Create DataFrames with multiple columns
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'dept': ['HR', 'IT', 'Finance']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David'], 'salary': [70000, 80000, 90000]})

# Merge DataFrames on both 'id' and 'name'
result = pd.merge(df1, df2, on=['id', 'name'], how='inner')
print(result)

Output:


   id     name     dept  salary
0   2      Bob      IT   70000
1   3  Charlie  Finance   80000

The merge is done on both the 'id' and 'name' columns, so rows must match both columns to be included in the result.

Handling Overlapping Column Names

When merging DataFrames, you might encounter columns with the same name in both DataFrames. To handle this, you can use the suffixes parameter to add suffixes to the overlapping column names.


# Create DataFrames with overlapping column names
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob'], 'salary': [50000, 60000]})
df2 = pd.DataFrame({'id': [2, 3], 'name': ['Bob', 'Charlie'], 'salary': [70000, 80000]})

# Merge DataFrames with suffixes for overlapping column names
result = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'), how='inner')
print(result)

Output:


   id name_left  salary_left name_right  salary_right
0   2       Bob        60000        Bob        70000

In the output, the 'salary' column is renamed to 'salary_left' and 'salary_right' to avoid confusion.

Conclusion

The pandas.merge() function is a powerful tool for combining DataFrames in Python. It supports different types of joins, can merge on multiple columns, and allows for handling overlapping column names. Mastering merge() is essential for working with relational data in Python and will help you perform complex data manipulations efficiently.

For more information, you can check out related topics like resetting DataFrame indexes, setting DataFrame indexes, and removing DataFrame rows/columns.