Last modified: Dec 22, 2025 By Alexander Williams

Feature Engineering Techniques in Python

Feature engineering is a key step in machine learning. It transforms raw data into useful features. This process improves model accuracy and performance.

Good features can make simple models perform well. Bad features can make complex models fail. Python offers powerful tools for this task.

This guide covers essential techniques. You will learn how to create better features. This leads to more reliable predictions.

What is Feature Engineering?

Feature engineering is the art of creating new input variables. These variables are derived from raw data. They help machine learning algorithms work better.

Think of it as preparing ingredients for a recipe. The quality of the meal depends on ingredient preparation. Similarly, model quality depends on feature preparation.

It often follows Exploratory Data Analysis Python Guide & Techniques. EDA helps you understand your data first. Then you can engineer features effectively.

Handling Missing Values

Real-world data often has missing values. You must handle them before modeling. Ignoring them can cause errors.

Common strategies include imputation and deletion. Imputation fills missing values with a statistic. Deletion removes rows or columns with missing data.

Pandas provides tools for this. The fillna() method is very useful. You can fill with mean, median, or mode.


import pandas as pd
import numpy as np

# Sample data with missing values
data = {'Age': [25, np.nan, 35, 40, np.nan, 28],
        'Salary': [50000, 60000, np.nan, 80000, 55000, 52000]}
df = pd.DataFrame(data)

# Fill missing Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Fill missing Salary with mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

print(df)

    Age        Salary
0  25.0  50000.000000
1  32.5  60000.000000
2  35.0  61250.000000
3  40.0  80000.000000
4  32.5  55000.000000
5  28.0  52000.000000

Encoding Categorical Variables

Most algorithms need numerical input. You must convert text categories to numbers. This process is called encoding.

Two main methods are Label Encoding and One-Hot Encoding. Label Encoding assigns a unique number to each category. One-Hot Encoding creates binary columns for each category.

Scikit-learn offers the LabelEncoder and OneHotEncoder classes. Pandas has the get_dummies() function for one-hot encoding.


from sklearn.preprocessing import LabelEncoder

# Sample categorical data
colors = ['Red', 'Blue', 'Green', 'Blue', 'Red']
df_colors = pd.DataFrame({'Color': colors})

# Apply Label Encoding
le = LabelEncoder()
df_colors['Color_Encoded'] = le.fit_transform(df_colors['Color'])

print(df_colors)

   Color  Color_Encoded
0    Red              2
1   Blue              0
2  Green              1
3   Blue              0
4    Red              2

Creating Interaction Features

Interaction features combine two or more existing features. They can capture relationships that single features miss.

For example, in real estate, price per square foot is an interaction. It combines price and area. It is often more informative than either alone.

You can create them with simple arithmetic. Multiplication or division are common operations. This is easy with Master Data Analysis with Pandas Python Guide skills.


# Sample product data
df_product = pd.DataFrame({
    'Price': [10, 15, 20],
    'Quantity_Sold': [100, 80, 60]
})

# Create an interaction feature: Total Revenue
df_product['Total_Revenue'] = df_product['Price'] * df_product['Quantity_Sold']

print(df_product)

   Price  Quantity_Sold  Total_Revenue
0     10            100           1000
1     15             80           1200
2     20             60           1200

Binning and Discretization

Binning converts continuous numbers into categories. This can simplify patterns and reduce noise.

Age is often binned into groups like 'Child', 'Adult', 'Senior'. This can be more meaningful for some models.

Use pandas cut() or qcut() for binning. cut() uses defined bins. qcut() creates bins based on quantiles.


# Create age data
ages = [5, 12, 25, 35, 42, 55, 70, 80]
df_age = pd.DataFrame({'Age': ages})

# Bin ages into categories
bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df_age['Age_Group'] = pd.cut(df_age['Age'], bins=bins, labels=labels)

print(df_age)

   Age   Age_Group
0    5       Child
1   12       Child
2   25  Young Adult
3   35        Adult
4   42        Adult
5   55        Adult
6   70      Senior
7   80      Senior

Scaling and Normalization

Features often have different scales. Salary might be in thousands. Age is usually under 100. This can bias some algorithms.

Scaling adjusts features to a common range. Normalization is a type of scaling. It brings data to a 0-1 range.

Scikit-learn provides MinMaxScaler and StandardScaler. Always fit the scaler on training data only. Then transform both training and test data.


from sklearn.preprocessing import MinMaxScaler

# Sample data with different scales
data = {'Income': [20000, 50000, 80000, 110000],
        'Age': [22, 35, 47, 60]}
df_scale = pd.DataFrame(data)

# Apply Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_scale)
df_scaled = pd.DataFrame(scaled_data, columns=['Income_Scaled', 'Age_Scaled'])

print(df_scaled)

   Income_Scaled  Age_Scaled
0       0.000000    0.000000
1       0.333333    0.342105
2       0.666667    0.657895
3       1.000000    1.000000

Extracting Date Features

Dates are rich with information. You can extract day, month, year, and weekday. This reveals trends like weekend sales spikes.

Pandas makes this easy. The dt accessor has many properties. You can get dayofweek, month, and more.

This technique is vital for time series data. It helps models understand temporal patterns. It's a key part of Integrate Python xlrd with pandas for Data Analysis workflows.


# Create a date series
dates = pd.Series(pd.date_range(start='2023-01-01', periods=5, freq='D'))
df_dates = pd.DataFrame({'Original_Date': dates})

# Extract features
df_dates['Year'] = df_dates['Original_Date'].dt.year
df_dates['Month'] = df_dates['Original_Date'].dt.month
df_dates['Day'] = df_dates['Original_Date'].dt.day
df_dates['Weekday'] = df_dates['Original_Date'].dt.day_name()

print(df_dates)

  Original_Date  Year  Month  Day   Weekday
0    2023-01-01  2023      1    1    Sunday
1    2023-01-02  2023      1    2    Monday
2    2023-01-03  2023      1    3   Tuesday
3    2023-01-04  2023      1    4 Wednesday
4    2023-01-05  2023      1    5  Thursday

Conclusion

Feature engineering is a creative and critical process. It turns raw data into model-ready features. This directly impacts your model's success.

We covered handling missing values and encoding categories. We also looked at creating interactions and binning. Scaling and date extraction are also powerful.

Start with simple techniques. Experiment to see what works for your data. Good feature engineering is an iterative process.

Use Python's pandas and scikit-learn libraries. They provide all the tools you need. Remember, better features often beat fancier algorithms.