Skip to main content

Command Palette

Search for a command to run...

Machine Learning Basics: Building Your First Simple Linear Regression Model

Updated
7 min read

Introduction

Welcome to the first post in our Machine Learning Basics series! In this tutorial, we'll dive into one of the most fundamental algorithms in machine learning: Linear Regression. We'll build a simple linear regression model to predict insurance charges based on various demographic and health factors.

Linear regression is an excellent starting point for anyone learning machine learning because it's intuitive, interpretable, and forms the foundation for many more complex algorithms.

What You'll Learn

By the end of this tutorial, you'll understand:

  • How to prepare data for machine learning

  • The basics of linear regression

  • How to build and train a linear regression model

  • How to evaluate model performance

  • How to interpret model coefficients

The Dataset

We'll be working with a health insurance dataset that contains information about:

  • Age: Age of the individual

  • Sex: Gender (male/female)

  • BMI: Body Mass Index

  • Children: Number of children/dependents

  • Smoker: Whether the person smokes (yes/no)

  • Region: Geographic region

  • Charges: Medical insurance charges (our target variable)

Getting Started

First, let's import the necessary libraries and load our dataset:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics
import numpy as np

# Load the insurance dataset
df = pd.read_csv('insurance.csv')
print(df.head())

Output:

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Let's examine the structure of our data:

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       1338 non-null   int64
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

This gives us important information about our dataset:

  • 1,338 entries (rows)

  • 7 columns with no missing values

  • Mix of numerical (age, bmi, charges) and categorical (sex, smoker, region) data

Data Preprocessing

Machine learning algorithms work with numerical data, so we need to convert categorical variables to numerical format. This process is called encoding.

# Convert categorical variables to numerical
df['sex'] = df['sex'].replace({'male': 1, 'female': 0})
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 0})

# Print the modified DataFrame to show the result
print("\nDataFrame after converting 'male' to 1 and 'female' to 0:")
print(df)

Output:

DataFrame after converting 'male' to 1 and 'female' to 0:
      age  sex     bmi  children  smoker     region      charges
0      19    0  27.900         0       0  southwest  16884.92400
1      18    1  33.770         1       1  southeast   1725.55230
2      28    1  33.000         3       1  southeast   4449.46200
3      33    1  22.705         0       1  northwest  21984.47061
4      32    1  28.880         0       1  northwest   3866.85520
...   ...  ...     ...       ...     ...        ...          ...
1333   50    1  30.970         3       1  northwest  10600.54830
1334   18    0  31.920         0       0  northeast   2205.98080
1335   18    0  36.850         0       0  southeast   1629.83350
1336   21    0  25.800         0       0  southwest   2007.94500
1337   61    0  29.070         0       0  northwest  29141.36030

[1338 rows x 7 columns]

Perfect! Now we can see that:

  • Sex: female = 0, male = 1

  • Smoker: no = 0, yes = 1

Exploratory Data Analysis

Before building our model, it's crucial to understand the relationships in our data. Visualization helps us identify patterns and potential issues.

# Create pairplot to visualize relationships
sns.pairplot(df)
plt.show()

# Focus on relationships with our target variable (charges)
sns.pairplot(data=df[['age', 'bmi', 'children', 'smoker', 'sex', 'charges']],
             x_vars=['age', 'smoker', 'bmi', 'sex'],
             y_vars='charges',
             aspect=1)
plt.show()

These visualizations help us understand:

  • Which variables might be good predictors of insurance charges

  • Whether there are any obvious outliers

  • The distribution of our data

Preparing the Data for Machine Learning

In machine learning, we separate our data into:

  • Features (X): The input variables we use to make predictions

  • Target (y): The variable we want to predict

# Select features (first 5 columns excluding region)
x = df.iloc[:, :5]  # age, sex, bmi, children, smoker
y = df.iloc[:, 6]   # charges

print(x.head())
print(y.head())

Output:

   age  sex     bmi  children  smoker
0   19    0  27.900         0       0
1   18    1  33.770         1       1
2   28    1  33.000         3       1
3   33    1  22.705         0       1
4   32    1  28.880         0       1

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

Building the Linear Regression Model

Now for the exciting part - building our machine learning model!

# Create and train the linear regression model
lr = linear_model.LinearRegression()
lr.fit(x, y)

# Display the coefficients
coeffs = pd.DataFrame(lr.coef_, x.columns, columns=['Coefficient'])
coeffs

Output:

          Coefficient
age        241.263511
sex        660.859891
bmi        326.761491
children   533.168130
smoker     660.859891

Understanding the Coefficients

The coefficients tell us how much each feature influences the insurance charges:

  • Age (241.26): For each additional year of age, insurance charges increase by ~$241

  • Sex (660.86): Being male (vs female) increases charges by ~$661

  • BMI (326.76): Each unit increase in BMI adds ~$327 to charges

  • Children (533.17): Each additional child increases charges by ~$533

  • Smoker (660.86): Being a smoker increases charges by ~$661

Making Predictions

Let's use our trained model to make predictions:

# Make predictions on our training data
predictions = lr.predict(x)
print(predictions)

# Compare actual vs predicted values
scores = pd.DataFrame({'Actual': y, 'Predicted': predictions})
scores.head()

Output:

[ 6240.68269989  9772.39705015 12999.76207347 ...  8923.93452889
  6037.01059213 16756.06111267]

        Actual     Predicted
0  16884.92400   6240.682700
1   1725.55230   9772.397050
2   4449.46200  12999.762073
3  21984.47061   9242.565695
4   3866.85520  11019.054388

Evaluating Model Performance

It's crucial to evaluate how well our model performs. We'll use several metrics:

# Calculate performance metrics
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, predictions)))
print('Mean Absolute Error:', metrics.mean_absolute_error(y, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y, predictions))

print("Average Cost:", y.mean())
print("R-quared:", metrics.r2_score(y, predictions))

Output:

Root Mean Squared Error: 11336.133773688362
Mean Absolute Error: 8982.350383484953
Mean Squared Error: 128507928.93495792
Average Cost: 13270.422265141257
R-quared: 0.12306876681889345

Understanding the Metrics

  • RMSE (11,336): On average, our predictions are off by about $11,336

  • MAE (8,982): The average absolute error is about $8,982

  • R-squared (0.123): Our model explains about 12.3% of the variance in insurance charges

What Does This Mean?

An R-squared of 0.123 means our simple model only explains about 12% of the variation in insurance charges. This suggests that:

  1. The model is quite basic - there's room for improvement

  2. Important features might be missing - perhaps we need more variables

  3. The relationship might not be purely linear - we might need more sophisticated models

Key Takeaways

  1. Linear regression is interpretable: We can easily understand how each feature affects the outcome

  2. Data preprocessing is crucial: Converting categorical variables to numerical format is essential

  3. Visualization helps: Exploring data relationships guides model building

  4. Model evaluation is important: Metrics help us understand model performance

  5. Simple models are a good starting point: Even basic models provide valuable insights

Next Steps

To improve this model, you could:

  1. Feature engineering: Create new features or transform existing ones

  2. Include more variables: Add the 'region' variable after proper encoding

  3. Try different algorithms: Random Forest, Support Vector Machines, etc.

  4. Handle outliers: Identify and address unusual data points

  5. Cross-validation: Use better evaluation techniques

Conclusion

Congratulations! You've built your first machine learning model using linear regression. While this simple model has limitations (R² of 0.123), it demonstrates the fundamental machine learning workflow:

  1. Data collection and exploration

  2. Data preprocessing

  3. Model training

  4. Prediction and evaluation

This foundation will serve you well as you explore more advanced machine learning techniques. In our next post, we'll explore how to improve this model and introduce more sophisticated algorithms.


What's Next? Stay tuned for our next post where we'll explore multiple linear regression with feature engineering and better evaluation techniques!

Have questions about this tutorial? Feel free to reach out or leave a comment below.