Last modified: Nov 10, 2024 By Alexander Williams

Python CSV Data Validation: Clean and Process Data Efficiently

Data validation and cleaning are crucial steps when working with CSV files. In this guide, we'll explore how to ensure your CSV data is accurate, consistent, and ready for analysis using Python.

1. Basic CSV Data Validation

Before diving into complex validation, let's start with basic checks using Python's built-in CSV module. When handling CSV errors properly, you'll avoid common pitfalls.


import csv

def validate_csv_structure(filename):
    try:
        with open(filename, 'r') as file:
            reader = csv.reader(file)
            header = next(reader)
            expected_columns = len(header)
            
            for row_num, row in enumerate(reader, 2):
                if len(row) != expected_columns:
                    print(f"Error in row {row_num}: Column mismatch")
                    
    except Exception as e:
        print(f"Error: {str(e)}")

2. Handling Missing Values

Handling missing data in CSV files is essential for data integrity. Here's how to identify and handle missing values using pandas:


import pandas as pd

def clean_missing_values(filename):
    df = pd.read_csv(filename)
    
    # Check for missing values
    print("Missing values count:")
    print(df.isnull().sum())
    
    # Fill missing values with appropriate methods
    df['numeric_column'] = df['numeric_column'].fillna(df['numeric_column'].mean())
    df['string_column'] = df['string_column'].fillna('Unknown')
    
    return df

3. Data Type Validation

Ensuring correct data types is crucial for analysis. Here's how to validate and convert data types:


def validate_data_types(df):
    # Define expected types
    type_dict = {
        'age': 'int64',
        'name': 'object',
        'salary': 'float64'
    }
    
    for column, expected_type in type_dict.items():
        try:
            df[column] = df[column].astype(expected_type)
        except Exception as e:
            print(f"Error converting {column}: {str(e)}")
            
    return df

4. Data Range and Pattern Validation

Validate data ranges and patterns to ensure data quality. This is particularly important when dealing with CSV files with special characters or encodings.


import re

def validate_patterns(df):
    # Email validation
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    df['valid_email'] = df['email'].apply(lambda x: bool(re.match(email_pattern, str(x))))
    
    # Age range validation
    df['valid_age'] = df['age'].between(0, 120)
    
    return df

5. Removing Duplicates

Duplicate entries can skew your analysis. Here's how to identify and remove them:


def remove_duplicates(df):
    # Check for duplicates
    duplicate_count = df.duplicated().sum()
    print(f"Found {duplicate_count} duplicate rows")
    
    # Remove duplicates
    df_clean = df.drop_duplicates()
    
    return df_clean

6. Complete Example

Here's a complete example combining all validation steps:


import pandas as pd
import csv
import re

def validate_and_clean_csv(filename):
    # Read CSV
    df = pd.read_csv(filename)
    
    # Clean missing values
    df = clean_missing_values(df)
    
    # Validate data types
    df = validate_data_types(df)
    
    # Validate patterns
    df = validate_patterns(df)
    
    # Remove duplicates
    df = remove_duplicates(df)
    
    # Save cleaned data
    df.to_csv('cleaned_data.csv', index=False)
    
    return df

# Usage
df = validate_and_clean_csv('input.csv')
print("Data cleaning completed!")

After cleaning, you might want to convert your cleaned CSV to Excel format for better visualization and analysis.

Conclusion

Data validation and cleaning are essential steps in any data processing pipeline. Using these techniques ensures your CSV data is reliable and ready for analysis.

For more complex scenarios, consider using pandas instead of the CSV module as it provides more powerful data manipulation capabilities.