Last modified: Nov 10, 2024 By Alexander Williams
How to Merge Multiple CSV Files in Python - Complete Guide
Merging multiple CSV files is a common task in data processing. Python offers several approaches to combine CSV files efficiently, whether you're working with small or large datasets.
Method 1: Using Pandas DataFrame
The pandas library provides powerful tools for handling CSV files. Here's how to merge multiple CSV files using pandas
:
import pandas as pd
import glob
# Get all CSV files in the directory
csv_files = glob.glob('*.csv')
# Create empty list to store dataframes
dfs = []
# Read each CSV file and append to list
for file in csv_files:
df = pd.read_csv(file)
dfs.append(df)
# Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)
# Save to new CSV file
combined_df.to_csv('combined_file.csv', index=False)
Method 2: Using CSV Module
For simpler merging tasks, the built-in CSV module can be effective. This method is particularly useful when dealing with memory constraints.
import csv
import glob
# Get all CSV files
csv_files = glob.glob('*.csv')
# Read the first file to get headers
with open(csv_files[0], 'r') as f:
headers = csv.reader(f).next()
# Write to output file
with open('combined_file.csv', 'w', newline='') as output:
writer = csv.writer(output)
writer.writerow(headers)
# Process each file
for file in csv_files:
with open(file, 'r') as input:
reader = csv.reader(input)
next(reader) # Skip header
for row in reader:
writer.writerow(row)
Handling Large Files
When dealing with large CSV files, it's important to consider memory usage. Here's an optimized approach using chunks:
import pandas as pd
# Create empty DataFrame for the first file
combined_df = pd.DataFrame()
# Process files in chunks
for file in csv_files:
chunks = pd.read_csv(file, chunksize=10000)
for chunk in chunks:
combined_df = pd.concat([combined_df, chunk], ignore_index=True)
combined_df.to_csv('combined_file.csv', index=False)
Handling Missing Data
When merging files, you might encounter missing data. Here's how to handle such cases:
import pandas as pd
dfs = []
for file in csv_files:
df = pd.read_csv(file)
df = df.fillna('') # Replace NaN with empty string
dfs.append(df)
combined_df = pd.concat(dfs, ignore_index=True)
Filtering and Sorting
After merging, you might want to filter or sort the data:
# Filter rows
filtered_df = combined_df[combined_df['column_name'] > 100]
# Sort by column
sorted_df = combined_df.sort_values('column_name')
# Save final result
sorted_df.to_csv('final_combined.csv', index=False)
Conclusion
Merging CSV files in Python can be accomplished through various methods. Choose the appropriate approach based on your file sizes and specific requirements.
Remember to validate your data after merging and consider using error handling for robust solutions. For complex operations, pandas is typically the recommended choice.