PraHari Tech

Giving My Robot Its First Sense: HC-SR04 Ultrasonic on Arduino UNO Q

Ashish Disawal — Wed, 01 Apr 2026 07:20:06 GMT

In the last post, I got the Arduino UNO Q working from the command line — blink LED, SSH, deploy apps.

Now it's time to give the board its first real input: distance sensing with an HC-SR04 ultrasonic sensor. This is the sensor that will eventually let my eldercare robot detect walls, furniture, and doorways as it patrols a home.

But first — can pulseIn() even work on Zephyr RTOS?

Why This Matters

HomeGuard Parivaar needs to navigate rooms autonomously. That means obstacle detection. The HC-SR04 is the cheapest, most reliable way to measure distance — if it works on the UNO Q's STM32 MCU running Zephyr.

The interesting part isn't the sensor itself (it's well-documented everywhere). It's the pattern:

The MCU reads the sensor in its own tight loop
The Python side on the MPU polls for data via Bridge whenever it needs it
Neither side blocks the other

This decoupled architecture is how all of the robot's sensors will work.

What We're Building

By the end of this post, you'll have:

An HC-SR04 wired to the UNO Q's MCU pins
A sketch that reads distance continuously and exposes it via Bridge
A Python script that polls and prints distance readings
Confidence that timing-sensitive Arduino functions work under Zephyr

Hardware You'll Need

Component	Notes
Arduino UNO Q	CLI setup complete (see Part 1)
HC-SR04 ultrasonic sensor	~$2, widely available
4x jumper wires (F-M)	VCC, GND, TRIG, ECHO

I used a breadboard to keep the wiring tidy.

Wiring

HC-SR04 Pin	UNO Q Pin
VCC	5V
GND	GND
TRIG	D2
ECHO	D3

No voltage divider needed. The STM32U585's digital pins are 5V tolerant (except A0/A1 — a detail from Lab 1.1 that saved me a resistor here).

The App Structure

Same pattern as the blink app from Part 1 — an app.yaml, a sketch folder, and a Python folder:

q-sonar/
├── app.yaml
├── sketch/
│   ├── sketch.ino
│   └── sketch.yaml
└── python/
    ├── main.py
    └── requirements.txt

The MCU Sketch

The MCU does two things:

Reads the sensor in loop() every 100ms
Exposes the last reading via a Bridge function

#include "Arduino_RouterBridge.h"

const int TRIG_PIN = 2;
const int ECHO_PIN = 3;

float last_distance_cm = -1.0;

void setup() {
    pinMode(TRIG_PIN, OUTPUT);
    pinMode(ECHO_PIN, INPUT);
    Bridge.begin();
    Bridge.provide("get_distance", get_distance);
}

void loop() {
    last_distance_cm = read_distance();
    delay(100);
}

float read_distance() {
    digitalWrite(TRIG_PIN, LOW);
    delayMicroseconds(2);
    digitalWrite(TRIG_PIN, HIGH);
    delayMicroseconds(10);
    digitalWrite(TRIG_PIN, LOW);

    long duration = pulseIn(ECHO_PIN, HIGH, 30000);  // 30ms timeout
    if (duration == 0) {
        return -1.0;  // No echo — out of range
    }
    return duration * 0.0343 / 2.0;  // cm
}

float get_distance() {
    return last_distance_cm;
}

A few things to note:

pulseIn() with a 30ms timeout — this caps the max range at ~5 meters (plenty for indoor rooms) and prevents the sketch from hanging if nothing echoes back.
0.0343 / 2.0 — speed of sound in cm/us, divided by 2 because the pulse travels to the object and back.
The loop() is NOT empty this time. Unlike the blink app where the MCU just waited for Bridge calls, here the MCU actively samples the sensor. The Bridge function get_distance() just returns the latest cached reading.

The Python Side

from arduino.app_utils import *
import time

def loop():
    distance = Bridge.call("get_distance")
    if distance < 0:
        print("No echo — out of range")
    else:
        print(f"Distance: {distance:.1f} cm")
    time.sleep(1)

App.run(user_loop=loop)

The Python side polls once per second. The MCU samples 10x per second.

This means the Python side always gets a fresh reading without needing to worry about sensor timing.

Deploy and Run

# Copy app to the board
ssh arduino-2gb 'mkdir -p ~/ArduinoApps/q_sonar'
scp -r q-sonar/* arduino-2gb:~/ArduinoApps/q_sonar/

# Start it
ssh arduino-2gb 'arduino-app-cli app start ~/ArduinoApps/q_sonar'

Check the logs:

ssh arduino-2gb 'arduino-app-cli app logs ~/ArduinoApps/q_sonar'

Distance: 46.3 cm
Distance: 137.2 cm
Distance: 44.6 cm
Distance: 48.0 cm
Distance: 17.1 cm
Distance: 138.9 cm

Move your hand in front of the sensor — near readings (~~17cm) and far readings (~~137cm to the wall) both respond correctly.

What Surprised Me

1. pulseIn() works perfectly on Zephyr. This was my main concern going in. Timing-sensitive functions can behave unpredictably under an RTOS, but the Arduino Zephyr core handles it cleanly. No jitter, no missed pulses.

2. I wired it wrong first. TRIG and ECHO were swapped. The sketch deployed fine, but every reading came back as "No echo — out of range."

I spent a few minutes reading the code looking for a Zephyr compatibility issue... when it was just two wires in the wrong pins. Check your wiring before debugging your code.

3. No external libraries needed. HC-SR04 only uses digitalRead, digitalWrite, and pulseIn — all built into the Arduino core for Zephyr. The only dependency is Arduino_RouterBridge for the Bridge pattern.

4. The MCU loop + Bridge pattern is the right architecture. The MCU samples at its own pace. The Python side reads when it needs to. Neither blocks the other.

This is exactly how the robot will work — the MCU manages real-time sensor reads, the MPU handles decision logic. One pattern, many sensors.

What's Next

The sensor works. The Bridge pattern works.

Next up: connecting a USB webcam to the MPU's Linux side — giving the robot eyes to go with its sonar. After that, bridging sensor data and camera feeds together for the robot's perception layer.

Follow along as I build an eldercare robot, one sensor at a time.

This is part of my journey building HomeGuard Parivaar — an autonomous eldercare robot for Indian families, built with Arduino UNO Q.

This is a hobby project and I'm learning by building. If you have suggestions, corrections, or criticism — I'd genuinely love to hear it.

Co-authored with Claude Code (Anthropic) — my AI pair-programming partner for this build. Cover image generated with Gemini (Google).

Arduino UNO Q is NOT a Regular Arduino: What I Learned the Hard Way

Ashish Disawal — Sun, 29 Mar 2026 14:03:15 GMT

If you just got an Arduino UNO Q and tried to use it like a classic Arduino, you probably hit a wall. I did. Here's the story of how Serial.println() taught me that the UNO Q is a fundamentally different kind of board — and how to actually develop for it from the command line.

Why This Matters

I'm building HomeGuard Parivaar — an autonomous home health robot for Indian families managing eldercare from a distance. The Arduino UNO Q is the brain: its dual-processor architecture lets me run ML models on the Linux side while controlling motors and sensors from the Arduino side.

But before I could build anything, I had to understand how this board actually works. The official docs point you toward App Lab (the GUI editor). I wanted to use the CLI — VS Code, Claude Code, terminal workflows. Getting there took some wrong turns.

What We're Building

By the end of this post, you'll have:

arduino-cli installed with the UNO Q Zephyr core
SSH access to the board's Linux side
A working blink app deployed via the command line
An understanding of why the UNO Q needs a completely different development model

Hardware You'll Need

Component	Notes
Arduino UNO Q (2GB or 4GB)	Must complete first-boot setup via App Lab first
USB-C data cable	Must be a data cable, not charge-only
WiFi network	Board connects via WiFi for SSH access

Before you start: If you haven't set up your UNO Q yet, follow the First Use guide to set your password, connect to WiFi, and update to the latest firmware. The CLI workflow in this post assumes your board is already initialized and on your network.

Step 1: Install arduino-cli

curl -fsSL https://raw.githubusercontent.com/arduino/arduino-cli/master/install.sh | BINDIR=~/bin sh
export PATH="\(HOME/bin:\)PATH"  # Add to .bashrc for permanence

Initialize and install the UNO Q core:

arduino-cli config init
arduino-cli core update-index
arduino-cli core install arduino:zephyr

Verify:

arduino-cli board listall | grep "UNO Q"
# Arduino UNO Q    arduino:zephyr:unoq

Step 2: The Classic Approach (and Why It Fails)

If you're coming from Arduino UNO/Nano/Mega, your instinct is:

void setup() {
  Serial.begin(115200);
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.println("Hello from UNO Q!");
}

void loop() {
  digitalWrite(LED_BUILTIN, HIGH);
  Serial.println("LED ON");
  delay(1000);
  digitalWrite(LED_BUILTIN, LOW);
  Serial.println("LED OFF");
  delay(1000);
}

Compile and upload:

arduino-cli compile --fqbn arduino:zephyr:unoq ./blink-test/
arduino-cli upload -p /dev/ttyACM0 --fqbn arduino:zephyr:unoq ./blink-test/

It compiles. It uploads. You open the serial monitor... nothing. No output. Maybe the LED blinks, maybe it doesn't.

What went wrong?

The Dual-Brain Architecture

The UNO Q isn't a microcontroller with USB. It's two processors on one board:

	MPU (Linux Brain)	MCU (Arduino Brain)
Chip	Qualcomm QRB2210	ST STM32U585
CPU	4x Cortex-A53 @ 2.0 GHz	Cortex-M33 @ 160 MHz
OS	Debian Linux	Zephyr RTOS
RAM	2GB or 4GB	786 KB
Manages	WiFi, USB, camera, AI/ML, Python	GPIO, sensors, motors, PWM

They talk to each other via Arduino Bridge — an RPC layer. And here's the critical detail:

The USB-C port is managed by the MPU (Linux side), not the MCU.

So when you call Serial.println() on the MCU, it writes to the hardware UART on pins D0/D1 — not to USB. To get output over USB, you need the Monitor object, which routes through the Bridge to the MPU. But the Bridge only works when the MPU is running its orchestration service.

When we called Bridge.begin() without the MPU side running, the sketch just hung. No blink, no serial, nothing.

Step 3: The Correct Way — App-Based Development

On the UNO Q, a project is an App with two halves:

The MCU sketch registers functions that the Python script can call:

sketch/sketch.ino:

#include "Arduino_RouterBridge.h"

void setup() {
    pinMode(LED_BUILTIN, OUTPUT);
    Bridge.begin();
    Bridge.provide("set_led_state", set_led_state);
}

void loop() {
}

void set_led_state(bool state) {
    digitalWrite(LED_BUILTIN, state ? LOW : HIGH);  // Active-low!
}

The Python script on the MPU drives the logic:

python/main.py:

from arduino.app_utils import *
import time

led_state = False

def loop():
    global led_state
    time.sleep(1)
    led_state = not led_state
    Bridge.call("set_led_state", led_state)
    print(f"LED {'ON' if led_state else 'OFF'}")

App.run(user_loop=loop)

sketch/sketch.yaml:

profiles:
  default:
    fqbn: arduino:zephyr:unoq
    platforms:
      - platform: arduino:zephyr
    libraries:
      - Arduino_RouterBridge (0.4.0)
      - Arduino_RPClite (0.2.1)
      - MsgPack (0.4.2)
      - DebugLog (0.8.4)
      - ArxContainer (0.7.0)
      - ArxTypeTraits (0.3.2)
default_profile: default

app.yaml:

name: LED Blink Test
description: "Simple LED blink via Bridge"
version: "1.0.0"
ports: []
bricks: []

Step 4: SSH In and Deploy

First, find your board's IP (from your router or App Lab). Then set up SSH:

ssh arduino@
# Enter the password you set during first-boot setup

I added an SSH key and config alias so I can just do:

ssh arduino-2gb

Deploy the app:

# From your host machine
ssh arduino-2gb 'mkdir -p ~/ArduinoApps/q_blink'
scp -r q-blink/* arduino-2gb:~/ArduinoApps/q_blink/

Start it:

ssh arduino-2gb 'arduino-app-cli app start ~/ArduinoApps/q_blink'

The first run downloads libraries, compiles the sketch on the board itself (yes, the 4-core Cortex-A53 compiles your Arduino sketch), flashes the MCU via SWD, and starts the Python container. After about 30 seconds:

Check the logs:

ssh arduino-2gb 'arduino-app-cli app logs ~/ArduinoApps/q_blink'

The LED blinks. The logs flow. It works.

What Surprised Me

1. Serial.println() doesn't go to USB. On classic Arduino, Serial = USB. On UNO Q, Serial = hardware UART pins D0/D1. This tripped me up for an hour.

2. The MCU sketch's loop() can be empty. The Python side drives the timing. The MCU just registers callbacks and waits. This is a paradigm shift — the MCU is a service provider, not the main loop.

3. Compilation happens on-board. Your host machine doesn't need the Zephyr toolchain for deployment. The board's Linux side has arduino-cli and compiles locally.

4. Python runs containerized. Docker compose manages the Python environment on the board. requirements.txt dependencies are auto-installed.

5. The RGB LEDs are active-low. digitalWrite(LED_BUILTIN, LOW) turns the LED on. Classic Arduino gotcha, amplified by the UNO Q's unfamiliarity.

6. Storage is tight. The 2GB variant has ~3GB free on a 9.8GB root partition. ML models and multiple apps will eat into this quickly.

CLI Cheat Sheet

# Deploy
scp -r myapp/* arduino-2gb:~/ArduinoApps/myapp/

# Start / stop
ssh arduino-2gb 'arduino-app-cli app start ~/ArduinoApps/myapp'
ssh arduino-2gb 'arduino-app-cli app stop ~/ArduinoApps/myapp'

# View Python print() output
ssh arduino-2gb 'arduino-app-cli app logs ~/ArduinoApps/myapp'

# View MCU Serial.println() output
ssh arduino-2gb 'arduino-app-cli monitor ~/ArduinoApps/myapp'

# Check what's running
ssh arduino-2gb 'arduino-app-cli app list'

What's Next

Now that the dev environment is working, I'm moving on to connecting sensors — starting with the HC-SR04 ultrasonic sensor for obstacle detection. The MCU will read the sensor, and the Python side will use the data for navigation decisions.

This is the foundation for HomeGuard Parivaar's autonomous patrol capability. Follow along as I build an eldercare robot, one sensor at a time.

This is part of my journey building HomeGuard Parivaar — an eldercare robot for Indian families.

Machine Learning Basics: Improving Model Performance with Feature Engineering

Ashish Disawal — Thu, 02 Oct 2025 17:43:47 GMT

Introduction

Welcome back to our Machine Learning Basics series! In our previous post, we built a simple linear regression model that achieved an R-squared score of only 0.123. While this gave us a good foundation, the model's predictive power was quite limited.

In this tutorial, we'll explore feature engineering - one of the most powerful techniques in machine learning. By creating new features from existing data, we'll dramatically improve our model's performance from an R-squared of 0.123 to 0.862!

What You'll Learn

By the end of this tutorial, you'll understand:

What feature engineering is and why it matters
How to create new features from domain knowledge
The concept of interaction features
How feature engineering can dramatically improve model performance
The importance of data visualization in feature discovery

What is Feature Engineering?

Feature Engineering is the process of using domain knowledge to create new features (variables) from existing data that make machine learning algorithms work better. It's often considered more of an art than a science, requiring creativity and understanding of the problem domain.

Good features can:

Capture important patterns in the data
Make relationships more apparent to the model
Significantly improve model accuracy

Getting Started

Let's begin by loading our dataset and necessary libraries:

# Install required modules and load the insurance dataset
!pip install pandas seaborn matplotlib numpy
import pandas as pd
import numpy as np
!curl -O https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv
df = pd.read_csv('insurance.csv')
df.head()

We're using the same insurance dataset from the previous tutorial. Let's quickly remind ourselves what it contains.

Output:

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Verifying Data Structure

Before we start engineering features, let's verify our dataset structure:

# Check the dataset structure and data count
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       1338 non-null   int64
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Perfect! We have 1,338 records with no missing values.

Data Quality Check

Let's verify that our dataset only contains adult records, since this is health insurance data:

# Check if we have any children records
print(df[df['age'] < 18])

Output:

Empty DataFrame
Columns: [age, sex, bmi, children, smoker, region, charges, obese]
Index: []

Good! All records are for adults (age 18 and above), which makes sense for individual health insurance policies.

Feature Engineering: Creating the Obesity Flag

Now comes the exciting part - creating new features! Our first engineered feature will be an obesity flag based on medical guidelines.

According to the World Health Organization (WHO):

Overweight: BMI ≥ 25
Obese: BMI ≥ 30

Let's create this feature along with converting our categorical variables to numerical format:

# Convert 'male' to 1 and 'female' to 0 using the .replace() method
df['sex'] = df['sex'].replace({'male': 1, 'female': 0}).astype('int8')
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 0}).astype('int8')

# Lets add a flag for obesity
# As per WHO [https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight]
# For adults, WHO defines overweight and obesity as follows:
# overweight is a BMI greater than or equal to 25; and
# obesity is a BMI greater than or equal to 30.

# Use np.where to apply the conditional logic:
# Condition: df['bmi'] >= 30
# Value if True: 1
# Value if False: 0

df['obese'] = np.where(df['bmi'] >= 30, 1, 0).astype('int8')

# Print the modified DataFrame to show the result
print("\nDataFrame after converting 'male' to 1 and 'female' to 0:")
print(df)

Output:

DataFrame after converting 'male' to 1 and 'female' to 0:
      age  sex     bmi  children  smoker     region      charges  obese
0      19    0  27.900         0       1  southwest  16884.92400      0
1      18    1  33.770         1       0  southeast   1725.55230      1
2      28    1  33.000         3       0  southeast   4449.46200      1
3      33    1  22.705         0       0  northwest  21984.47061      0
4      32    1  28.880         0       0  northwest   3866.85520      0
...   ...  ...     ...       ...     ...        ...          ...    ...
1333   50    1  30.970         3       0  northwest  10600.54830      1
1334   18    0  31.920         0       0  northeast   2205.98080      1
1335   18    0  36.850         0       0  southeast   1629.83350      1
1336   21    0  25.800         0       0  southwest   2007.94500      0
1337   61    0  29.070         0       1  northwest  29141.36030      0

[1338 rows x 8 columns]

Notice our new obese column! We've now converted the continuous BMI variable into a binary flag. This can sometimes help models capture non-linear relationships more effectively.

Why Create an Obesity Flag?

While we already have BMI as a continuous variable, creating a binary obesity flag can help because:

Medical research shows obesity (BMI ≥ 30) is a distinct risk category
It captures a threshold effect that might be harder for linear models to detect
It's based on domain knowledge from healthcare

Visualizing Relationships

Let's explore how age and charges are related with some visualizations. This helps us understand our data and discover potential new features.

# Explore charges vs age data
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

df.plot(kind='scatter', x='age', y='charges', figsize=(10, 5)).set_title("Charges vs Age")

This scatter plot shows how insurance charges vary with age. Notice the distinct clusters - this suggests there might be important categorical factors affecting charges.

Discovering the Smoking Impact

Let's visualize how smoking status affects the relationship between age and charges:

# Explore the impact of age and smoking
g = sns.pairplot(data = df[['age', 'sex', 'bmi', 'children', 'smoker', 'charges']],
                 x_vars=['age'], y_vars=['charges'], aspect=1.5, hue='smoker')
g.fig.set_size_inches(10, 5)
plt.title("Impact of age and smoking on charges")

This visualization is revealing! We can see two distinct clusters:

Non-smokers (blue): Lower charges that increase gradually with age
Smokers (orange): Significantly higher charges with steeper age-related increases

This suggests that smoking has a major impact on insurance charges, and this impact might vary with age.

Exploring the Obesity Effect

Now let's examine how obesity affects the relationship:

# Explore the impact of age and obesity on charges
g = sns.pairplot(data = df[['age', 'sex', 'bmi','obese', 'children', 'smoker', 'charges']],
                 x_vars=['age'], y_vars=['charges'], aspect=1.5, hue='obese')
g.fig.set_size_inches(10, 5)
plt.title("Impact of age and obesity on charges")

Obesity also shows a clear effect on insurance charges, though perhaps not as pronounced as smoking.

Creating an Interaction Feature

Here's where feature engineering gets really powerful. We noticed that both smoking and obesity affect charges. But what about people who are both smokers and obese? This combination might have an amplified effect.

This is called an interaction feature - a new feature created by combining two or more existing features to capture their combined effect.

# Lets create a new feature which represents a product of smoker and obesity feature
df['smoker_obese'] = df['smoker'] * df['obese']
print("Number of customers who are both obese and smoke: ", df[df.smoker_obese == 1].shape[0])
print("Total number of customers: ", df.shape[0])

Output:

Number of customers who are both obese and smoke:  145
Total number of customers:  1338

About 10% of customers are both smokers and obese. This is a high-risk group that likely has significantly higher insurance charges.

What is an Interaction Feature?

An interaction feature captures the combined effect of two or more features. The mathematical operation here is multiplication:

If someone is obese (1) AND a smoker (1): smoker_obese = 1 × 1 = 1
If someone is only obese: smoker_obese = 1 × 0 = 0
If someone is only a smoker: smoker_obese = 0 × 1 = 0
If neither: smoker_obese = 0 × 0 = 0

This allows the model to assign a separate coefficient to this high-risk combination.

Preparing Features for Training

Now let's select our features for model training. Notice we're including our newly engineered features:

# Lets create new dataframes with the features and one with target
x = df[['age', 'bmi', 'sex', 'children', 'smoker', 'obese', 'smoker_obese']]
y = df['charges']

Our feature set now includes:

Original features: age, bmi, sex, children, smoker
Engineered features: obese, smoker_obese

Training the Improved Model

Let's train a linear regression model with our enhanced feature set:

# Lets train the model
from sklearn import linear_model

# Create a new Linear Regression model
lr = linear_model.LinearRegression()

# Train the model
lr.fit(x, y)

# Print the coefficients
ceoffs = pd.DataFrame(lr.coef_, x.columns, columns=['Coefficient'])
print(ceoffs)

Output:

               Coefficient
age             263.807602
bmi              98.637188
sex            -488.091970
children        515.971652
smoker        13431.633343
obese          -805.123043
smoker_obese  19734.622381

Understanding the New Coefficients

Let's interpret what these coefficients tell us:

age (263.81): Each additional year adds ~$264 to charges
bmi (98.64): Each BMI unit adds ~$99 to charges (note: much lower than before)
sex (-488.09): Males have ~$488 lower charges than females (interesting!)
children (515.97): Each child adds ~$516 to charges
smoker (13,431.63): Smoking adds a whopping ~$13,432 to charges!
obese (-805.12): Obesity flag alone actually shows negative effect (because the interaction term captures the real impact)
smoker_obese (19,734.62): Being both a smoker AND obese adds an additional ~$19,735!

The smoker_obese coefficient is the highest, confirming our hypothesis that this combination is especially costly.

Making Predictions

Let's use our improved model to make predictions:

# Lets try to predict
predictions = lr.predict(x)
print(predictions)

scores = pd.DataFrame({'Actual': y, 'Predicted': predictions})
scores.head()

Output:

[16316.56695109  2422.88293888  6016.95162578 ...  2698.805796
  3205.41071655 27511.89173746]

        Actual     Predicted
0  16884.92400  16316.566951
1   1725.55230   2422.882939
2   4449.46200   6016.951626
3  21984.47061   5577.727872
4   3866.85520   5923.004906

Notice how much closer the predictions are to the actual values compared to our first model!

Evaluating the Improved Model

Now for the moment of truth - let's see how much our feature engineering improved the model:

from sklearn import metrics

print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, predictions)))
print('Mean Absolute Error:', metrics.mean_absolute_error(y, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y, predictions))

print("Average Cost:", y.mean())
print("R-squared:", metrics.r2_score(y, predictions))

Output:

Root Mean Squared Error: 4490.387801338095
Mean Absolute Error: 2460.035500296957
Mean Squared Error: 20163582.606405977
Average Cost: 13270.422265141257
R-squared: 0.8624047908410836

Performance Comparison

Let's compare our improved model with the original:

Metric	Original Model	Improved Model	Change
RMSE	$11,336	$4,490	✅ 60% reduction
MAE	$8,982	$2,460	✅ 73% reduction
R-squared	0.123	0.862	✅ 601% increase

What This Means

Our improved model explains 86.2% of the variance in insurance charges, compared to just 12.3% before. This is a dramatic improvement!

RMSE dropped by 60%: Our predictions are now much more accurate
MAE dropped by 73%: The average prediction error is just $2,460 instead of $8,982
R-squared increased to 0.862: We now explain 86.2% of the variation in charges

This demonstrates the enormous power of feature engineering!

Key Takeaways

Feature engineering is powerful: Simple feature engineering improved R² from 0.123 to 0.862
Domain knowledge matters: Understanding obesity thresholds helped create meaningful features
Interaction features capture combined effects: The smoker_obese feature was crucial
Visualization guides feature creation: Plotting helped us discover the smoking and obesity patterns
Small datasets benefit greatly from good features: With only 1,338 records, feature engineering was essential

Why Did This Work So Well?

Our feature engineering succeeded because:

Domain-driven: We used medical knowledge (BMI ≥ 30 for obesity) to create meaningful categories
Captured non-linearity: The obesity flag helped the linear model capture threshold effects
Interaction effects: The smoker_obese feature captured the amplified risk of combined factors
Data-driven discovery: Visualization helped us identify which features to engineer

Next Steps

To further improve this model, you could:

Create more interaction features: Try age * smoker, bmi * age, etc.
Polynomial features: Create squared or cubed terms (age², bmi², etc.)
Encode region: We excluded region - adding it might help
Try other algorithms: Random Forest or Gradient Boosting might capture even more patterns
Cross-validation: Use proper train/test splits to validate performance

Conclusion

Congratulations! You've seen firsthand how powerful feature engineering can be. By adding just two simple features (obesity flag and smoker-obesity interaction), we improved our model's R² from 0.123 to 0.862 - a massive improvement!

This tutorial demonstrates a key principle in machine learning: Better features often matter more than better algorithms. Before reaching for complex deep learning models, invest time in understanding your data and engineering meaningful features.

Remember the workflow:

Explore your data through visualization
Apply domain knowledge to create meaningful features
Test interaction effects between important variables
Evaluate and iterate on your features

In our next post, we'll explore train-test splits, cross-validation, and how to properly evaluate model performance to avoid overfitting.

What's Next? Stay tuned for our next post where we'll explore proper model validation techniques and introduce regularization!

Have questions about feature engineering? Feel free to reach out or leave a comment below.

Machine Learning Basics: Building Your First Simple Linear Regression Model

Ashish Disawal — Sun, 21 Sep 2025 14:30:00 GMT

Introduction

Welcome to the first post in our Machine Learning Basics series! In this tutorial, we'll dive into one of the most fundamental algorithms in machine learning: Linear Regression. We'll build a simple linear regression model to predict insurance charges based on various demographic and health factors.

Linear regression is an excellent starting point for anyone learning machine learning because it's intuitive, interpretable, and forms the foundation for many more complex algorithms.

What You'll Learn

By the end of this tutorial, you'll understand:

How to prepare data for machine learning
The basics of linear regression
How to build and train a linear regression model
How to evaluate model performance
How to interpret model coefficients

The Dataset

We'll be working with a health insurance dataset that contains information about:

Age: Age of the individual
Sex: Gender (male/female)
BMI: Body Mass Index
Children: Number of children/dependents
Smoker: Whether the person smokes (yes/no)
Region: Geographic region
Charges: Medical insurance charges (our target variable)

Getting Started

First, let's import the necessary libraries and load our dataset:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics
import numpy as np

# Load the insurance dataset
df = pd.read_csv('insurance.csv')
print(df.head())

Output:

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Let's examine the structure of our data:

df.info()

Output:


RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       1338 non-null   int64
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

This gives us important information about our dataset:

1,338 entries (rows)
7 columns with no missing values
Mix of numerical (age, bmi, charges) and categorical (sex, smoker, region) data

Data Preprocessing

Machine learning algorithms work with numerical data, so we need to convert categorical variables to numerical format. This process is called encoding.

# Convert categorical variables to numerical
df['sex'] = df['sex'].replace({'male': 1, 'female': 0})
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 0})

# Print the modified DataFrame to show the result
print("\nDataFrame after converting 'male' to 1 and 'female' to 0:")
print(df)

Output:

DataFrame after converting 'male' to 1 and 'female' to 0:
      age  sex     bmi  children  smoker     region      charges
0      19    0  27.900         0       0  southwest  16884.92400
1      18    1  33.770         1       1  southeast   1725.55230
2      28    1  33.000         3       1  southeast   4449.46200
3      33    1  22.705         0       1  northwest  21984.47061
4      32    1  28.880         0       1  northwest   3866.85520
...   ...  ...     ...       ...     ...        ...          ...
1333   50    1  30.970         3       1  northwest  10600.54830
1334   18    0  31.920         0       0  northeast   2205.98080
1335   18    0  36.850         0       0  southeast   1629.83350
1336   21    0  25.800         0       0  southwest   2007.94500
1337   61    0  29.070         0       0  northwest  29141.36030

[1338 rows x 7 columns]

Perfect! Now we can see that:

Sex: female = 0, male = 1
Smoker: no = 0, yes = 1

Exploratory Data Analysis

Before building our model, it's crucial to understand the relationships in our data. Visualization helps us identify patterns and potential issues.

# Create pairplot to visualize relationships
sns.pairplot(df)
plt.show()

# Focus on relationships with our target variable (charges)
sns.pairplot(data=df[['age', 'bmi', 'children', 'smoker', 'sex', 'charges']],
             x_vars=['age', 'smoker', 'bmi', 'sex'],
             y_vars='charges',
             aspect=1)
plt.show()

These visualizations help us understand:

Which variables might be good predictors of insurance charges
Whether there are any obvious outliers
The distribution of our data

Preparing the Data for Machine Learning

In machine learning, we separate our data into:

Features (X): The input variables we use to make predictions
Target (y): The variable we want to predict

# Select features (first 5 columns excluding region)
x = df.iloc[:, :5]  # age, sex, bmi, children, smoker
y = df.iloc[:, 6]   # charges

print(x.head())
print(y.head())

Output:

   age  sex     bmi  children  smoker
0   19    0  27.900         0       0
1   18    1  33.770         1       1
2   28    1  33.000         3       1
3   33    1  22.705         0       1
4   32    1  28.880         0       1

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

Building the Linear Regression Model

Now for the exciting part - building our machine learning model!

# Create and train the linear regression model
lr = linear_model.LinearRegression()
lr.fit(x, y)

# Display the coefficients
coeffs = pd.DataFrame(lr.coef_, x.columns, columns=['Coefficient'])
coeffs

Output:

          Coefficient
age        241.263511
sex        660.859891
bmi        326.761491
children   533.168130
smoker     660.859891

Understanding the Coefficients

The coefficients tell us how much each feature influences the insurance charges:

Age (241.26): For each additional year of age, insurance charges increase by ~$241
Sex (660.86): Being male (vs female) increases charges by ~$661
BMI (326.76): Each unit increase in BMI adds ~$327 to charges
Children (533.17): Each additional child increases charges by ~$533
Smoker (660.86): Being a smoker increases charges by ~$661

Making Predictions

Let's use our trained model to make predictions:

# Make predictions on our training data
predictions = lr.predict(x)
print(predictions)

# Compare actual vs predicted values
scores = pd.DataFrame({'Actual': y, 'Predicted': predictions})
scores.head()

Output:

[ 6240.68269989  9772.39705015 12999.76207347 ...  8923.93452889
  6037.01059213 16756.06111267]

        Actual     Predicted
0  16884.92400   6240.682700
1   1725.55230   9772.397050
2   4449.46200  12999.762073
3  21984.47061   9242.565695
4   3866.85520  11019.054388

Evaluating Model Performance

It's crucial to evaluate how well our model performs. We'll use several metrics:

# Calculate performance metrics
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, predictions)))
print('Mean Absolute Error:', metrics.mean_absolute_error(y, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y, predictions))

print("Average Cost:", y.mean())
print("R-quared:", metrics.r2_score(y, predictions))

Output:

Root Mean Squared Error: 11336.133773688362
Mean Absolute Error: 8982.350383484953
Mean Squared Error: 128507928.93495792
Average Cost: 13270.422265141257
R-quared: 0.12306876681889345

Understanding the Metrics

RMSE (11,336): On average, our predictions are off by about $11,336
MAE (8,982): The average absolute error is about $8,982
R-squared (0.123): Our model explains about 12.3% of the variance in insurance charges

What Does This Mean?

An R-squared of 0.123 means our simple model only explains about 12% of the variation in insurance charges. This suggests that:

The model is quite basic - there's room for improvement
Important features might be missing - perhaps we need more variables
The relationship might not be purely linear - we might need more sophisticated models

Key Takeaways

Linear regression is interpretable: We can easily understand how each feature affects the outcome
Data preprocessing is crucial: Converting categorical variables to numerical format is essential
Visualization helps: Exploring data relationships guides model building
Model evaluation is important: Metrics help us understand model performance
Simple models are a good starting point: Even basic models provide valuable insights

Next Steps

To improve this model, you could:

Feature engineering: Create new features or transform existing ones
Include more variables: Add the 'region' variable after proper encoding
Try different algorithms: Random Forest, Support Vector Machines, etc.
Handle outliers: Identify and address unusual data points
Cross-validation: Use better evaluation techniques

Conclusion

Congratulations! You've built your first machine learning model using linear regression. While this simple model has limitations (R² of 0.123), it demonstrates the fundamental machine learning workflow:

Data collection and exploration
Data preprocessing
Model training
Prediction and evaluation

This foundation will serve you well as you explore more advanced machine learning techniques. In our next post, we'll explore how to improve this model and introduce more sophisticated algorithms.

What's Next? Stay tuned for our next post where we'll explore multiple linear regression with feature engineering and better evaluation techniques!

Have questions about this tutorial? Feel free to reach out or leave a comment below.