Skip to main content

Command Palette

Search for a command to run...

Machine Learning Basics: Improving Model Performance with Feature Engineering

Updated
10 min read

Introduction

Welcome back to our Machine Learning Basics series! In our previous post, we built a simple linear regression model that achieved an R-squared score of only 0.123. While this gave us a good foundation, the model's predictive power was quite limited.

In this tutorial, we'll explore feature engineering - one of the most powerful techniques in machine learning. By creating new features from existing data, we'll dramatically improve our model's performance from an R-squared of 0.123 to 0.862!

What You'll Learn

By the end of this tutorial, you'll understand:

  • What feature engineering is and why it matters

  • How to create new features from domain knowledge

  • The concept of interaction features

  • How feature engineering can dramatically improve model performance

  • The importance of data visualization in feature discovery

What is Feature Engineering?

Feature Engineering is the process of using domain knowledge to create new features (variables) from existing data that make machine learning algorithms work better. It's often considered more of an art than a science, requiring creativity and understanding of the problem domain.

Good features can:

  • Capture important patterns in the data

  • Make relationships more apparent to the model

  • Significantly improve model accuracy

Getting Started

Let's begin by loading our dataset and necessary libraries:

# Install required modules and load the insurance dataset
!pip install pandas seaborn matplotlib numpy
import pandas as pd
import numpy as np
!curl -O https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv
df = pd.read_csv('insurance.csv')
df.head()

We're using the same insurance dataset from the previous tutorial. Let's quickly remind ourselves what it contains.

Output:

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Verifying Data Structure

Before we start engineering features, let's verify our dataset structure:

# Check the dataset structure and data count
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       1338 non-null   int64
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Perfect! We have 1,338 records with no missing values.

Data Quality Check

Let's verify that our dataset only contains adult records, since this is health insurance data:

# Check if we have any children records
print(df[df['age'] < 18])

Output:

Empty DataFrame
Columns: [age, sex, bmi, children, smoker, region, charges, obese]
Index: []

Good! All records are for adults (age 18 and above), which makes sense for individual health insurance policies.

Feature Engineering: Creating the Obesity Flag

Now comes the exciting part - creating new features! Our first engineered feature will be an obesity flag based on medical guidelines.

According to the World Health Organization (WHO):

  • Overweight: BMI ≥ 25

  • Obese: BMI ≥ 30

Let's create this feature along with converting our categorical variables to numerical format:

# Convert 'male' to 1 and 'female' to 0 using the .replace() method
df['sex'] = df['sex'].replace({'male': 1, 'female': 0}).astype('int8')
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 0}).astype('int8')

# Lets add a flag for obesity
# As per WHO [https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight]
# For adults, WHO defines overweight and obesity as follows:
# overweight is a BMI greater than or equal to 25; and
# obesity is a BMI greater than or equal to 30.

# Use np.where to apply the conditional logic:
# Condition: df['bmi'] >= 30
# Value if True: 1
# Value if False: 0

df['obese'] = np.where(df['bmi'] >= 30, 1, 0).astype('int8')

# Print the modified DataFrame to show the result
print("\nDataFrame after converting 'male' to 1 and 'female' to 0:")
print(df)

Output:

DataFrame after converting 'male' to 1 and 'female' to 0:
      age  sex     bmi  children  smoker     region      charges  obese
0      19    0  27.900         0       1  southwest  16884.92400      0
1      18    1  33.770         1       0  southeast   1725.55230      1
2      28    1  33.000         3       0  southeast   4449.46200      1
3      33    1  22.705         0       0  northwest  21984.47061      0
4      32    1  28.880         0       0  northwest   3866.85520      0
...   ...  ...     ...       ...     ...        ...          ...    ...
1333   50    1  30.970         3       0  northwest  10600.54830      1
1334   18    0  31.920         0       0  northeast   2205.98080      1
1335   18    0  36.850         0       0  southeast   1629.83350      1
1336   21    0  25.800         0       0  southwest   2007.94500      0
1337   61    0  29.070         0       1  northwest  29141.36030      0

[1338 rows x 8 columns]

Notice our new obese column! We've now converted the continuous BMI variable into a binary flag. This can sometimes help models capture non-linear relationships more effectively.

Why Create an Obesity Flag?

While we already have BMI as a continuous variable, creating a binary obesity flag can help because:

  • Medical research shows obesity (BMI ≥ 30) is a distinct risk category

  • It captures a threshold effect that might be harder for linear models to detect

  • It's based on domain knowledge from healthcare

Visualizing Relationships

Let's explore how age and charges are related with some visualizations. This helps us understand our data and discover potential new features.

# Explore charges vs age data
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

df.plot(kind='scatter', x='age', y='charges', figsize=(10, 5)).set_title("Charges vs Age")

This scatter plot shows how insurance charges vary with age. Notice the distinct clusters - this suggests there might be important categorical factors affecting charges.

Discovering the Smoking Impact

Let's visualize how smoking status affects the relationship between age and charges:

# Explore the impact of age and smoking
g = sns.pairplot(data = df[['age', 'sex', 'bmi', 'children', 'smoker', 'charges']],
                 x_vars=['age'], y_vars=['charges'], aspect=1.5, hue='smoker')
g.fig.set_size_inches(10, 5)
plt.title("Impact of age and smoking on charges")

This visualization is revealing! We can see two distinct clusters:

  • Non-smokers (blue): Lower charges that increase gradually with age

  • Smokers (orange): Significantly higher charges with steeper age-related increases

This suggests that smoking has a major impact on insurance charges, and this impact might vary with age.

Exploring the Obesity Effect

Now let's examine how obesity affects the relationship:

# Explore the impact of age and obesity on charges
g = sns.pairplot(data = df[['age', 'sex', 'bmi','obese', 'children', 'smoker', 'charges']],
                 x_vars=['age'], y_vars=['charges'], aspect=1.5, hue='obese')
g.fig.set_size_inches(10, 5)
plt.title("Impact of age and obesity on charges")

Obesity also shows a clear effect on insurance charges, though perhaps not as pronounced as smoking.

Creating an Interaction Feature

Here's where feature engineering gets really powerful. We noticed that both smoking and obesity affect charges. But what about people who are both smokers and obese? This combination might have an amplified effect.

This is called an interaction feature - a new feature created by combining two or more existing features to capture their combined effect.

# Lets create a new feature which represents a product of smoker and obesity feature
df['smoker_obese'] = df['smoker'] * df['obese']
print("Number of customers who are both obese and smoke: ", df[df.smoker_obese == 1].shape[0])
print("Total number of customers: ", df.shape[0])

Output:

Number of customers who are both obese and smoke:  145
Total number of customers:  1338

About 10% of customers are both smokers and obese. This is a high-risk group that likely has significantly higher insurance charges.

What is an Interaction Feature?

An interaction feature captures the combined effect of two or more features. The mathematical operation here is multiplication:

  • If someone is obese (1) AND a smoker (1): smoker_obese = 1 × 1 = 1

  • If someone is only obese: smoker_obese = 1 × 0 = 0

  • If someone is only a smoker: smoker_obese = 0 × 1 = 0

  • If neither: smoker_obese = 0 × 0 = 0

This allows the model to assign a separate coefficient to this high-risk combination.

Preparing Features for Training

Now let's select our features for model training. Notice we're including our newly engineered features:

# Lets create new dataframes with the features and one with target
x = df[['age', 'bmi', 'sex', 'children', 'smoker', 'obese', 'smoker_obese']]
y = df['charges']

Our feature set now includes:

  • Original features: age, bmi, sex, children, smoker

  • Engineered features: obese, smoker_obese

Training the Improved Model

Let's train a linear regression model with our enhanced feature set:

# Lets train the model
from sklearn import linear_model

# Create a new Linear Regression model
lr = linear_model.LinearRegression()

# Train the model
lr.fit(x, y)

# Print the coefficients
ceoffs = pd.DataFrame(lr.coef_, x.columns, columns=['Coefficient'])
print(ceoffs)

Output:

               Coefficient
age             263.807602
bmi              98.637188
sex            -488.091970
children        515.971652
smoker        13431.633343
obese          -805.123043
smoker_obese  19734.622381

Understanding the New Coefficients

Let's interpret what these coefficients tell us:

  • age (263.81): Each additional year adds ~$264 to charges

  • bmi (98.64): Each BMI unit adds ~$99 to charges (note: much lower than before)

  • sex (-488.09): Males have ~$488 lower charges than females (interesting!)

  • children (515.97): Each child adds ~$516 to charges

  • smoker (13,431.63): Smoking adds a whopping ~$13,432 to charges!

  • obese (-805.12): Obesity flag alone actually shows negative effect (because the interaction term captures the real impact)

  • smoker_obese (19,734.62): Being both a smoker AND obese adds an additional ~$19,735!

The smoker_obese coefficient is the highest, confirming our hypothesis that this combination is especially costly.

Making Predictions

Let's use our improved model to make predictions:

# Lets try to predict
predictions = lr.predict(x)
print(predictions)

scores = pd.DataFrame({'Actual': y, 'Predicted': predictions})
scores.head()

Output:

[16316.56695109  2422.88293888  6016.95162578 ...  2698.805796
  3205.41071655 27511.89173746]

        Actual     Predicted
0  16884.92400  16316.566951
1   1725.55230   2422.882939
2   4449.46200   6016.951626
3  21984.47061   5577.727872
4   3866.85520   5923.004906

Notice how much closer the predictions are to the actual values compared to our first model!

Evaluating the Improved Model

Now for the moment of truth - let's see how much our feature engineering improved the model:

from sklearn import metrics

print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, predictions)))
print('Mean Absolute Error:', metrics.mean_absolute_error(y, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y, predictions))

print("Average Cost:", y.mean())
print("R-squared:", metrics.r2_score(y, predictions))

Output:

Root Mean Squared Error: 4490.387801338095
Mean Absolute Error: 2460.035500296957
Mean Squared Error: 20163582.606405977
Average Cost: 13270.422265141257
R-squared: 0.8624047908410836

Performance Comparison

Let's compare our improved model with the original:

MetricOriginal ModelImproved ModelChange
RMSE$11,336$4,490✅ 60% reduction
MAE$8,982$2,460✅ 73% reduction
R-squared0.1230.862✅ 601% increase

What This Means

Our improved model explains 86.2% of the variance in insurance charges, compared to just 12.3% before. This is a dramatic improvement!

  • RMSE dropped by 60%: Our predictions are now much more accurate

  • MAE dropped by 73%: The average prediction error is just $2,460 instead of $8,982

  • R-squared increased to 0.862: We now explain 86.2% of the variation in charges

This demonstrates the enormous power of feature engineering!

Key Takeaways

  1. Feature engineering is powerful: Simple feature engineering improved R² from 0.123 to 0.862

  2. Domain knowledge matters: Understanding obesity thresholds helped create meaningful features

  3. Interaction features capture combined effects: The smoker_obese feature was crucial

  4. Visualization guides feature creation: Plotting helped us discover the smoking and obesity patterns

  5. Small datasets benefit greatly from good features: With only 1,338 records, feature engineering was essential

Why Did This Work So Well?

Our feature engineering succeeded because:

  1. Domain-driven: We used medical knowledge (BMI ≥ 30 for obesity) to create meaningful categories

  2. Captured non-linearity: The obesity flag helped the linear model capture threshold effects

  3. Interaction effects: The smoker_obese feature captured the amplified risk of combined factors

  4. Data-driven discovery: Visualization helped us identify which features to engineer

Next Steps

To further improve this model, you could:

  1. Create more interaction features: Try age * smoker, bmi * age, etc.

  2. Polynomial features: Create squared or cubed terms (age², bmi², etc.)

  3. Encode region: We excluded region - adding it might help

  4. Try other algorithms: Random Forest or Gradient Boosting might capture even more patterns

  5. Cross-validation: Use proper train/test splits to validate performance

Conclusion

Congratulations! You've seen firsthand how powerful feature engineering can be. By adding just two simple features (obesity flag and smoker-obesity interaction), we improved our model's R² from 0.123 to 0.862 - a massive improvement!

This tutorial demonstrates a key principle in machine learning: Better features often matter more than better algorithms. Before reaching for complex deep learning models, invest time in understanding your data and engineering meaningful features.

Remember the workflow:

  1. Explore your data through visualization

  2. Apply domain knowledge to create meaningful features

  3. Test interaction effects between important variables

  4. Evaluate and iterate on your features

In our next post, we'll explore train-test splits, cross-validation, and how to properly evaluate model performance to avoid overfitting.


What's Next? Stay tuned for our next post where we'll explore proper model validation techniques and introduce regularization!

Have questions about feature engineering? Feel free to reach out or leave a comment below.

Enhancing Models with Feature Engineering