Machine Learning Basics: Building Your First Simple Linear Regression Model
Introduction
Welcome to the first post in our Machine Learning Basics series! In this tutorial, we'll dive into one of the most fundamental algorithms in machine learning: Linear Regression. We'll build a simple linear regression model to predict insurance charges based on various demographic and health factors.
Linear regression is an excellent starting point for anyone learning machine learning because it's intuitive, interpretable, and forms the foundation for many more complex algorithms.
What You'll Learn
By the end of this tutorial, you'll understand:
How to prepare data for machine learning
The basics of linear regression
How to build and train a linear regression model
How to evaluate model performance
How to interpret model coefficients
The Dataset
We'll be working with a health insurance dataset that contains information about:
Age: Age of the individual
Sex: Gender (male/female)
BMI: Body Mass Index
Children: Number of children/dependents
Smoker: Whether the person smokes (yes/no)
Region: Geographic region
Charges: Medical insurance charges (our target variable)
Getting Started
First, let's import the necessary libraries and load our dataset:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics
import numpy as np
# Load the insurance dataset
df = pd.read_csv('insurance.csv')
print(df.head())
Output:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
Let's examine the structure of our data:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
This gives us important information about our dataset:
1,338 entries (rows)
7 columns with no missing values
Mix of numerical (age, bmi, charges) and categorical (sex, smoker, region) data
Data Preprocessing
Machine learning algorithms work with numerical data, so we need to convert categorical variables to numerical format. This process is called encoding.
# Convert categorical variables to numerical
df['sex'] = df['sex'].replace({'male': 1, 'female': 0})
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 0})
# Print the modified DataFrame to show the result
print("\nDataFrame after converting 'male' to 1 and 'female' to 0:")
print(df)
Output:
DataFrame after converting 'male' to 1 and 'female' to 0:
age sex bmi children smoker region charges
0 19 0 27.900 0 0 southwest 16884.92400
1 18 1 33.770 1 1 southeast 1725.55230
2 28 1 33.000 3 1 southeast 4449.46200
3 33 1 22.705 0 1 northwest 21984.47061
4 32 1 28.880 0 1 northwest 3866.85520
... ... ... ... ... ... ... ...
1333 50 1 30.970 3 1 northwest 10600.54830
1334 18 0 31.920 0 0 northeast 2205.98080
1335 18 0 36.850 0 0 southeast 1629.83350
1336 21 0 25.800 0 0 southwest 2007.94500
1337 61 0 29.070 0 0 northwest 29141.36030
[1338 rows x 7 columns]
Perfect! Now we can see that:
Sex:
female= 0,male= 1Smoker:
no= 0,yes= 1
Exploratory Data Analysis
Before building our model, it's crucial to understand the relationships in our data. Visualization helps us identify patterns and potential issues.
# Create pairplot to visualize relationships
sns.pairplot(df)
plt.show()

# Focus on relationships with our target variable (charges)
sns.pairplot(data=df[['age', 'bmi', 'children', 'smoker', 'sex', 'charges']],
x_vars=['age', 'smoker', 'bmi', 'sex'],
y_vars='charges',
aspect=1)
plt.show()

These visualizations help us understand:
Which variables might be good predictors of insurance charges
Whether there are any obvious outliers
The distribution of our data
Preparing the Data for Machine Learning
In machine learning, we separate our data into:
Features (X): The input variables we use to make predictions
Target (y): The variable we want to predict
# Select features (first 5 columns excluding region)
x = df.iloc[:, :5] # age, sex, bmi, children, smoker
y = df.iloc[:, 6] # charges
print(x.head())
print(y.head())
Output:
age sex bmi children smoker
0 19 0 27.900 0 0
1 18 1 33.770 1 1
2 28 1 33.000 3 1
3 33 1 22.705 0 1
4 32 1 28.880 0 1
0 16884.92400
1 1725.55230
2 4449.46200
3 21984.47061
4 3866.85520
Name: charges, dtype: float64
Building the Linear Regression Model
Now for the exciting part - building our machine learning model!
# Create and train the linear regression model
lr = linear_model.LinearRegression()
lr.fit(x, y)
# Display the coefficients
coeffs = pd.DataFrame(lr.coef_, x.columns, columns=['Coefficient'])
coeffs
Output:
Coefficient
age 241.263511
sex 660.859891
bmi 326.761491
children 533.168130
smoker 660.859891
Understanding the Coefficients
The coefficients tell us how much each feature influences the insurance charges:
Age (241.26): For each additional year of age, insurance charges increase by ~$241
Sex (660.86): Being male (vs female) increases charges by ~$661
BMI (326.76): Each unit increase in BMI adds ~$327 to charges
Children (533.17): Each additional child increases charges by ~$533
Smoker (660.86): Being a smoker increases charges by ~$661
Making Predictions
Let's use our trained model to make predictions:
# Make predictions on our training data
predictions = lr.predict(x)
print(predictions)
# Compare actual vs predicted values
scores = pd.DataFrame({'Actual': y, 'Predicted': predictions})
scores.head()
Output:
[ 6240.68269989 9772.39705015 12999.76207347 ... 8923.93452889
6037.01059213 16756.06111267]
Actual Predicted
0 16884.92400 6240.682700
1 1725.55230 9772.397050
2 4449.46200 12999.762073
3 21984.47061 9242.565695
4 3866.85520 11019.054388
Evaluating Model Performance
It's crucial to evaluate how well our model performs. We'll use several metrics:
# Calculate performance metrics
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, predictions)))
print('Mean Absolute Error:', metrics.mean_absolute_error(y, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print("Average Cost:", y.mean())
print("R-quared:", metrics.r2_score(y, predictions))
Output:
Root Mean Squared Error: 11336.133773688362
Mean Absolute Error: 8982.350383484953
Mean Squared Error: 128507928.93495792
Average Cost: 13270.422265141257
R-quared: 0.12306876681889345
Understanding the Metrics
RMSE (11,336): On average, our predictions are off by about $11,336
MAE (8,982): The average absolute error is about $8,982
R-squared (0.123): Our model explains about 12.3% of the variance in insurance charges
What Does This Mean?
An R-squared of 0.123 means our simple model only explains about 12% of the variation in insurance charges. This suggests that:
The model is quite basic - there's room for improvement
Important features might be missing - perhaps we need more variables
The relationship might not be purely linear - we might need more sophisticated models
Key Takeaways
Linear regression is interpretable: We can easily understand how each feature affects the outcome
Data preprocessing is crucial: Converting categorical variables to numerical format is essential
Visualization helps: Exploring data relationships guides model building
Model evaluation is important: Metrics help us understand model performance
Simple models are a good starting point: Even basic models provide valuable insights
Next Steps
To improve this model, you could:
Feature engineering: Create new features or transform existing ones
Include more variables: Add the 'region' variable after proper encoding
Try different algorithms: Random Forest, Support Vector Machines, etc.
Handle outliers: Identify and address unusual data points
Cross-validation: Use better evaluation techniques
Conclusion
Congratulations! You've built your first machine learning model using linear regression. While this simple model has limitations (R² of 0.123), it demonstrates the fundamental machine learning workflow:
Data collection and exploration
Data preprocessing
Model training
Prediction and evaluation
This foundation will serve you well as you explore more advanced machine learning techniques. In our next post, we'll explore how to improve this model and introduce more sophisticated algorithms.
What's Next? Stay tuned for our next post where we'll explore multiple linear regression with feature engineering and better evaluation techniques!
Have questions about this tutorial? Feel free to reach out or leave a comment below.


