<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[PraHari Tech]]></title><description><![CDATA[PraHari Tech]]></description><link>https://prahari.net</link><generator>RSS for Node</generator><lastBuildDate>Tue, 07 Apr 2026 10:49:43 GMT</lastBuildDate><atom:link href="https://prahari.net/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Giving My Robot Its First Sense: HC-SR04 Ultrasonic on Arduino UNO Q]]></title><description><![CDATA[In the last post, I got the Arduino UNO Q working from the command line — blink LED, SSH, deploy apps.
Now it's time to give the board its first real input: distance sensing with an HC-SR04 ultrasonic]]></description><link>https://prahari.net/giving-my-robot-its-first-sense-hc-sr04-ultrasonic-on-arduino-uno-q</link><guid isPermaLink="true">https://prahari.net/giving-my-robot-its-first-sense-hc-sr04-ultrasonic-on-arduino-uno-q</guid><category><![CDATA[arduino]]></category><category><![CDATA[uno-q]]></category><category><![CDATA[hc-sr04]]></category><category><![CDATA[Ultrasonic]]></category><category><![CDATA[sensors]]></category><category><![CDATA[zephyr]]></category><category><![CDATA[iot]]></category><category><![CDATA[robotics]]></category><category><![CDATA[elder care]]></category><dc:creator><![CDATA[Ashish Disawal]]></dc:creator><pubDate>Wed, 01 Apr 2026 07:20:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/642589c67e2a99b3d04a6166/3c5e78fe-1974-4d17-a0c6-90f37f4259f6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the <a href="https://prahari.net/arduino-uno-q-is-not-a-regular-arduino-what-i-learned-the-hard-way">last post</a>, I got the Arduino UNO Q working from the command line — blink LED, SSH, deploy apps.</p>
<p>Now it's time to give the board its first real input: distance sensing with an HC-SR04 ultrasonic sensor. This is the sensor that will eventually let my eldercare robot detect walls, furniture, and doorways as it patrols a home.</p>
<p>But first — can <code>pulseIn()</code> even work on Zephyr RTOS?</p>
<h2>Why This Matters</h2>
<p>HomeGuard Parivaar needs to navigate rooms autonomously. That means obstacle detection. The HC-SR04 is the cheapest, most reliable way to measure distance — if it works on the UNO Q's STM32 MCU running Zephyr.</p>
<p>The interesting part isn't the sensor itself (it's well-documented everywhere). It's the <strong>pattern</strong>:</p>
<ul>
<li><p>The MCU reads the sensor in its own tight loop</p>
</li>
<li><p>The Python side on the MPU polls for data via Bridge whenever it needs it</p>
</li>
<li><p>Neither side blocks the other</p>
</li>
</ul>
<p>This decoupled architecture is how all of the robot's sensors will work.</p>
<h2>What We're Building</h2>
<p>By the end of this post, you'll have:</p>
<ul>
<li><p>An HC-SR04 wired to the UNO Q's MCU pins</p>
</li>
<li><p>A sketch that reads distance continuously and exposes it via Bridge</p>
</li>
<li><p>A Python script that polls and prints distance readings</p>
</li>
<li><p>Confidence that timing-sensitive Arduino functions work under Zephyr</p>
</li>
</ul>
<h2>Hardware You'll Need</h2>
<table>
<thead>
<tr>
<th>Component</th>
<th>Notes</th>
</tr>
</thead>
<tbody><tr>
<td>Arduino UNO Q</td>
<td>CLI setup complete (<a href="https://prahari.net/arduino-uno-q-is-not-a-regular-arduino-what-i-learned-the-hard-way">see Part 1</a>)</td>
</tr>
<tr>
<td>HC-SR04 ultrasonic sensor</td>
<td>~$2, widely available</td>
</tr>
<tr>
<td>4x jumper wires (F-M)</td>
<td>VCC, GND, TRIG, ECHO</td>
</tr>
</tbody></table>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/002/lab-1.3-hcsr04-sensor.jpg" alt="" style="display:block;margin:0 auto" />

<p>I used a breadboard to keep the wiring tidy.</p>
<h2>Wiring</h2>
<table>
<thead>
<tr>
<th>HC-SR04 Pin</th>
<th>UNO Q Pin</th>
</tr>
</thead>
<tbody><tr>
<td>VCC</td>
<td>5V</td>
</tr>
<tr>
<td>GND</td>
<td>GND</td>
</tr>
<tr>
<td>TRIG</td>
<td>D2</td>
</tr>
<tr>
<td>ECHO</td>
<td>D3</td>
</tr>
</tbody></table>
<p><strong>No voltage divider needed.</strong> The STM32U585's digital pins are 5V tolerant (except A0/A1 — a detail from Lab 1.1 that saved me a resistor here).</p>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/002/lab-1.3-hcsr04-wiring.jpg" alt="" style="display:block;margin:0 auto" />

<h2>The App Structure</h2>
<p>Same pattern as the blink app from Part 1 — an <code>app.yaml</code>, a sketch folder, and a Python folder:</p>
<pre><code class="language-plaintext">q-sonar/
├── app.yaml
├── sketch/
│   ├── sketch.ino
│   └── sketch.yaml
└── python/
    ├── main.py
    └── requirements.txt
</code></pre>
<h2>The MCU Sketch</h2>
<p>The MCU does two things:</p>
<ul>
<li><p>Reads the sensor in <code>loop()</code> every 100ms</p>
</li>
<li><p>Exposes the last reading via a Bridge function</p>
</li>
</ul>
<pre><code class="language-cpp">#include "Arduino_RouterBridge.h"

const int TRIG_PIN = 2;
const int ECHO_PIN = 3;

float last_distance_cm = -1.0;

void setup() {
    pinMode(TRIG_PIN, OUTPUT);
    pinMode(ECHO_PIN, INPUT);
    Bridge.begin();
    Bridge.provide("get_distance", get_distance);
}

void loop() {
    last_distance_cm = read_distance();
    delay(100);
}

float read_distance() {
    digitalWrite(TRIG_PIN, LOW);
    delayMicroseconds(2);
    digitalWrite(TRIG_PIN, HIGH);
    delayMicroseconds(10);
    digitalWrite(TRIG_PIN, LOW);

    long duration = pulseIn(ECHO_PIN, HIGH, 30000);  // 30ms timeout
    if (duration == 0) {
        return -1.0;  // No echo — out of range
    }
    return duration * 0.0343 / 2.0;  // cm
}

float get_distance() {
    return last_distance_cm;
}
</code></pre>
<p>A few things to note:</p>
<ul>
<li><p><code>pulseIn()</code> <strong>with a 30ms timeout</strong> — this caps the max range at ~5 meters (plenty for indoor rooms) and prevents the sketch from hanging if nothing echoes back.</p>
</li>
<li><p><code>0.0343 / 2.0</code> — speed of sound in cm/us, divided by 2 because the pulse travels to the object and back.</p>
</li>
<li><p><strong>The</strong> <code>loop()</code> <strong>is NOT empty this time.</strong> Unlike the blink app where the MCU just waited for Bridge calls, here the MCU actively samples the sensor. The Bridge function <code>get_distance()</code> just returns the latest cached reading.</p>
</li>
</ul>
<h2>The Python Side</h2>
<pre><code class="language-python">from arduino.app_utils import *
import time

def loop():
    distance = Bridge.call("get_distance")
    if distance &lt; 0:
        print("No echo — out of range")
    else:
        print(f"Distance: {distance:.1f} cm")
    time.sleep(1)

App.run(user_loop=loop)
</code></pre>
<p>The Python side polls once per second. The MCU samples 10x per second.</p>
<p>This means the Python side always gets a fresh reading without needing to worry about sensor timing.</p>
<h2>Deploy and Run</h2>
<pre><code class="language-bash"># Copy app to the board
ssh arduino-2gb 'mkdir -p ~/ArduinoApps/q_sonar'
scp -r q-sonar/* arduino-2gb:~/ArduinoApps/q_sonar/

# Start it
ssh arduino-2gb 'arduino-app-cli app start ~/ArduinoApps/q_sonar'
</code></pre>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/002/lab-1.3-app-start.png" alt="App compile, flash, and start" style="display:block;margin:0 auto" />

<p>Check the logs:</p>
<pre><code class="language-bash">ssh arduino-2gb 'arduino-app-cli app logs ~/ArduinoApps/q_sonar'
</code></pre>
<pre><code class="language-plaintext">Distance: 46.3 cm
Distance: 137.2 cm
Distance: 44.6 cm
Distance: 48.0 cm
Distance: 17.1 cm
Distance: 138.9 cm
</code></pre>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/002/lab-1.3-app-logs.png" alt="App logs showing distance readings" style="display:block;margin:0 auto" />

<p>Move your hand in front of the sensor — near readings (<del>17cm) and far readings (</del>137cm to the wall) both respond correctly.</p>
<h2>What Surprised Me</h2>
<p><strong>1.</strong> <code>pulseIn()</code> <strong>works perfectly on Zephyr.</strong> This was my main concern going in. Timing-sensitive functions can behave unpredictably under an RTOS, but the Arduino Zephyr core handles it cleanly. No jitter, no missed pulses.</p>
<p><strong>2. I wired it wrong first.</strong> TRIG and ECHO were swapped. The sketch deployed fine, but every reading came back as "No echo — out of range."</p>
<p>I spent a few minutes reading the code looking for a Zephyr compatibility issue... when it was just two wires in the wrong pins. Check your wiring before debugging your code.</p>
<p><strong>3. No external libraries needed.</strong> HC-SR04 only uses <code>digitalRead</code>, <code>digitalWrite</code>, and <code>pulseIn</code> — all built into the Arduino core for Zephyr. The only dependency is <code>Arduino_RouterBridge</code> for the Bridge pattern.</p>
<p><strong>4. The MCU loop + Bridge pattern is the right architecture.</strong> The MCU samples at its own pace. The Python side reads when it needs to. Neither blocks the other.</p>
<p>This is exactly how the robot will work — the MCU manages real-time sensor reads, the MPU handles decision logic. One pattern, many sensors.</p>
<h2>What's Next</h2>
<p>The sensor works. The Bridge pattern works.</p>
<p>Next up: connecting a USB webcam to the MPU's Linux side — giving the robot eyes to go with its sonar. After that, bridging sensor data and camera feeds together for the robot's perception layer.</p>
<p>Follow along as I build an eldercare robot, one sensor at a time.</p>
<hr />
<p><em>This is part of my journey building</em> <em>HomeGuard Parivaar</em> <em>— an autonomous eldercare robot for Indian families, built with</em> <a href="https://store.arduino.cc/products/uno-q"><em>Arduino UNO Q.</em></a></p>
<p><em>This is a hobby project and I'm learning by building. If you have suggestions, corrections, or criticism — I'd genuinely love to hear it.</em></p>
<p><em>Co-authored with</em> <a href="https://claude.com/product/claude-code"><em>Claude Code</em></a> <em>(Anthropic) — my AI pair-programming partner for this build. Cover image generated with</em> <a href="https://gemini.google.com"><em>Gemini</em></a> <em>(Google).</em></p>
]]></content:encoded></item><item><title><![CDATA[Arduino UNO Q is NOT a Regular Arduino: What I Learned the Hard Way]]></title><description><![CDATA[If you just got an Arduino UNO Q and tried to use it like a classic Arduino, you probably hit a wall. I did. Here's the story of how Serial.println() taught me that the UNO Q is a fundamentally differ]]></description><link>https://prahari.net/arduino-uno-q-is-not-a-regular-arduino-what-i-learned-the-hard-way</link><guid isPermaLink="true">https://prahari.net/arduino-uno-q-is-not-a-regular-arduino-what-i-learned-the-hard-way</guid><category><![CDATA[arduino]]></category><category><![CDATA[uno-q]]></category><category><![CDATA[qualcomm]]></category><category><![CDATA[cli]]></category><category><![CDATA[embedded linux]]></category><category><![CDATA[iot]]></category><category><![CDATA[robotics]]></category><category><![CDATA[elder care]]></category><dc:creator><![CDATA[Ashish Disawal]]></dc:creator><pubDate>Sun, 29 Mar 2026 14:03:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/642589c67e2a99b3d04a6166/c911a00d-4989-4554-b6d9-5471413b8c5b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you just got an Arduino UNO Q and tried to use it like a classic Arduino, you probably hit a wall. I did. Here's the story of how <code>Serial.println()</code> taught me that the UNO Q is a fundamentally different kind of board — and how to actually develop for it from the command line.</p>
<h2>Why This Matters</h2>
<p>I'm building HomeGuard Parivaar — an autonomous home health robot for Indian families managing eldercare from a distance. The Arduino UNO Q is the brain: its dual-processor architecture lets me run ML models on the Linux side while controlling motors and sensors from the Arduino side.</p>
<p>But before I could build anything, I had to understand how this board actually works. The official docs point you toward App Lab (the GUI editor). I wanted to use the CLI — VS Code, Claude Code, terminal workflows. Getting there took some wrong turns.</p>
<h2>What We're Building</h2>
<p>By the end of this post, you'll have:</p>
<ul>
<li><p><code>arduino-cli</code> installed with the UNO Q Zephyr core</p>
</li>
<li><p>SSH access to the board's Linux side</p>
</li>
<li><p>A working blink app deployed via the command line</p>
</li>
<li><p>An understanding of why the UNO Q needs a completely different development model</p>
</li>
</ul>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-arduino-uno-q-boot-sequence.gif" alt="" style="display:block;margin:0 auto" />

<h2>Hardware You'll Need</h2>
<table>
<thead>
<tr>
<th>Component</th>
<th>Notes</th>
</tr>
</thead>
<tbody><tr>
<td>Arduino UNO Q (2GB or 4GB)</td>
<td>Must complete <a href="https://docs.arduino.cc/tutorials/uno-q/user-manual/#first-use">first-boot setup</a> via App Lab first</td>
</tr>
<tr>
<td>USB-C data cable</td>
<td><strong>Must be a data cable</strong>, not charge-only</td>
</tr>
<tr>
<td>WiFi network</td>
<td>Board connects via WiFi for SSH access</td>
</tr>
</tbody></table>
<p><strong>Before you start:</strong> If you haven't set up your UNO Q yet, follow the <a href="https://docs.arduino.cc/tutorials/uno-q/user-manual/#first-use">First Use guide</a> to set your password, connect to WiFi, and update to the latest firmware. The CLI workflow in this post assumes your board is already initialized and on your network.</p>
<h2>Step 1: Install arduino-cli</h2>
<pre><code class="language-bash">curl -fsSL https://raw.githubusercontent.com/arduino/arduino-cli/master/install.sh | BINDIR=~/bin sh
export PATH="\(HOME/bin:\)PATH"  # Add to .bashrc for permanence
</code></pre>
<p>Initialize and install the UNO Q core:</p>
<pre><code class="language-bash">arduino-cli config init
arduino-cli core update-index
arduino-cli core install arduino:zephyr
</code></pre>
<p>Verify:</p>
<pre><code class="language-bash">arduino-cli board listall | grep "UNO Q"
# Arduino UNO Q    arduino:zephyr:unoq
</code></pre>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-board-listall.png" alt="" style="display:block;margin:0 auto" />

<h2>Step 2: The Classic Approach (and Why It Fails)</h2>
<p>If you're coming from Arduino UNO/Nano/Mega, your instinct is:</p>
<pre><code class="language-cpp">void setup() {
  Serial.begin(115200);
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.println("Hello from UNO Q!");
}

void loop() {
  digitalWrite(LED_BUILTIN, HIGH);
  Serial.println("LED ON");
  delay(1000);
  digitalWrite(LED_BUILTIN, LOW);
  Serial.println("LED OFF");
  delay(1000);
}
</code></pre>
<p>Compile and upload:</p>
<pre><code class="language-bash">arduino-cli compile --fqbn arduino:zephyr:unoq ./blink-test/
arduino-cli upload -p /dev/ttyACM0 --fqbn arduino:zephyr:unoq ./blink-test/
</code></pre>
<p>It compiles. It uploads. You open the serial monitor... <strong>nothing</strong>. No output. Maybe the LED blinks, maybe it doesn't.</p>
<p>What went wrong?</p>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-serial-monitor-empty.png" alt="" style="display:block;margin:0 auto" />

<h2>The Dual-Brain Architecture</h2>
<p>The UNO Q isn't a microcontroller with USB. It's <strong>two processors on one board</strong>:</p>
<table>
<thead>
<tr>
<th></th>
<th>MPU (Linux Brain)</th>
<th>MCU (Arduino Brain)</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Chip</strong></td>
<td>Qualcomm QRB2210</td>
<td>ST STM32U585</td>
</tr>
<tr>
<td><strong>CPU</strong></td>
<td>4x Cortex-A53 @ 2.0 GHz</td>
<td>Cortex-M33 @ 160 MHz</td>
</tr>
<tr>
<td><strong>OS</strong></td>
<td>Debian Linux</td>
<td>Zephyr RTOS</td>
</tr>
<tr>
<td><strong>RAM</strong></td>
<td>2GB or 4GB</td>
<td>786 KB</td>
</tr>
<tr>
<td><strong>Manages</strong></td>
<td>WiFi, USB, camera, AI/ML, Python</td>
<td>GPIO, sensors, motors, PWM</td>
</tr>
</tbody></table>
<p>They talk to each other via <strong>Arduino Bridge</strong> — an RPC layer. And here's the critical detail:</p>
<p><strong>The USB-C port is managed by the MPU (Linux side), not the MCU.</strong></p>
<p>So when you call <code>Serial.println()</code> on the MCU, it writes to the hardware UART on pins D0/D1 — not to USB. To get output over USB, you need the <code>Monitor</code> object, which routes through the Bridge to the MPU. But the Bridge only works when the MPU is running its orchestration service.</p>
<p>When we called <code>Bridge.begin()</code> without the MPU side running, the sketch just hung. No blink, no serial, nothing.</p>
<h2>Step 3: The Correct Way — App-Based Development</h2>
<p>On the UNO Q, a project is an <strong>App</strong> with two halves:</p>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-app-structure.png" alt="" style="display:block;margin:0 auto" />

<p>The MCU sketch <strong>registers functions</strong> that the Python script can call:</p>
<p><strong>sketch/sketch.ino:</strong></p>
<pre><code class="language-cpp">#include "Arduino_RouterBridge.h"

void setup() {
    pinMode(LED_BUILTIN, OUTPUT);
    Bridge.begin();
    Bridge.provide("set_led_state", set_led_state);
}

void loop() {
}

void set_led_state(bool state) {
    digitalWrite(LED_BUILTIN, state ? LOW : HIGH);  // Active-low!
}
</code></pre>
<p>The Python script on the MPU <strong>drives the logic</strong>:</p>
<p><strong>python/main.py:</strong></p>
<pre><code class="language-python">from arduino.app_utils import *
import time

led_state = False

def loop():
    global led_state
    time.sleep(1)
    led_state = not led_state
    Bridge.call("set_led_state", led_state)
    print(f"LED {'ON' if led_state else 'OFF'}")

App.run(user_loop=loop)
</code></pre>
<p><strong>sketch/sketch.yaml:</strong></p>
<pre><code class="language-yaml">profiles:
  default:
    fqbn: arduino:zephyr:unoq
    platforms:
      - platform: arduino:zephyr
    libraries:
      - Arduino_RouterBridge (0.4.0)
      - Arduino_RPClite (0.2.1)
      - MsgPack (0.4.2)
      - DebugLog (0.8.4)
      - ArxContainer (0.7.0)
      - ArxTypeTraits (0.3.2)
default_profile: default
</code></pre>
<p><strong>app.yaml:</strong></p>
<pre><code class="language-yaml">name: LED Blink Test
description: "Simple LED blink via Bridge"
version: "1.0.0"
ports: []
bricks: []
</code></pre>
<h2>Step 4: SSH In and Deploy</h2>
<p>First, find your board's IP (from your router or App Lab). Then set up SSH:</p>
<pre><code class="language-bash">ssh arduino@&lt;YOUR_BOARD_IP&gt;
# Enter the password you set during first-boot setup
</code></pre>
<p>I added an SSH key and config alias so I can just do:</p>
<pre><code class="language-bash">ssh arduino-2gb
</code></pre>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-ssh-login.png" alt="" style="display:block;margin:0 auto" />

<p>Deploy the app:</p>
<pre><code class="language-bash"># From your host machine
ssh arduino-2gb 'mkdir -p ~/ArduinoApps/q_blink'
scp -r q-blink/* arduino-2gb:~/ArduinoApps/q_blink/
</code></pre>
<p>Start it:</p>
<pre><code class="language-bash">ssh arduino-2gb 'arduino-app-cli app start ~/ArduinoApps/q_blink'
</code></pre>
<p>The first run downloads libraries, compiles the sketch <strong>on the board itself</strong> (yes, the 4-core Cortex-A53 compiles your Arduino sketch), flashes the MCU via SWD, and starts the Python container. After about 30 seconds:</p>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-app-start.png" alt="" style="display:block;margin:0 auto" />

<p>Check the logs:</p>
<pre><code class="language-bash">ssh arduino-2gb 'arduino-app-cli app logs ~/ArduinoApps/q_blink'
</code></pre>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-app-logs.png" alt="" style="display:block;margin:0 auto" />

<p>The LED blinks. The logs flow. It works.</p>
<img src="https://hashnode-media.s3.amazonaws.com/homeguard/blog/media/001/001-led-blinking.gif" alt="" style="display:block;margin:0 auto" />

<h2>What Surprised Me</h2>
<p><strong>1.</strong> <code>Serial.println()</code> <strong>doesn't go to USB.</strong> On classic Arduino, Serial = USB. On UNO Q, Serial = hardware UART pins D0/D1. This tripped me up for an hour.</p>
<p><strong>2. The MCU sketch's</strong> <code>loop()</code> <strong>can be empty.</strong> The Python side drives the timing. The MCU just registers callbacks and waits. This is a paradigm shift — the MCU is a <em>service provider</em>, not the main loop.</p>
<p><strong>3. Compilation happens on-board.</strong> Your host machine doesn't need the Zephyr toolchain for deployment. The board's Linux side has <code>arduino-cli</code> and compiles locally.</p>
<p><strong>4. Python runs containerized.</strong> Docker compose manages the Python environment on the board. <code>requirements.txt</code> dependencies are auto-installed.</p>
<p><strong>5. The RGB LEDs are active-low.</strong> <code>digitalWrite(LED_BUILTIN, LOW)</code> turns the LED <em>on</em>. Classic Arduino gotcha, amplified by the UNO Q's unfamiliarity.</p>
<p><strong>6. Storage is tight.</strong> The 2GB variant has ~3GB free on a 9.8GB root partition. ML models and multiple apps will eat into this quickly.</p>
<h2>CLI Cheat Sheet</h2>
<pre><code class="language-bash"># Deploy
scp -r myapp/* arduino-2gb:~/ArduinoApps/myapp/

# Start / stop
ssh arduino-2gb 'arduino-app-cli app start ~/ArduinoApps/myapp'
ssh arduino-2gb 'arduino-app-cli app stop ~/ArduinoApps/myapp'

# View Python print() output
ssh arduino-2gb 'arduino-app-cli app logs ~/ArduinoApps/myapp'

# View MCU Serial.println() output
ssh arduino-2gb 'arduino-app-cli monitor ~/ArduinoApps/myapp'

# Check what's running
ssh arduino-2gb 'arduino-app-cli app list'
</code></pre>
<h2>What's Next</h2>
<p>Now that the dev environment is working, I'm moving on to connecting sensors — starting with the HC-SR04 ultrasonic sensor for obstacle detection. The MCU will read the sensor, and the Python side will use the data for navigation decisions.</p>
<p>This is the foundation for HomeGuard Parivaar's autonomous patrol capability. Follow along as I build an eldercare robot, one sensor at a time.</p>
<hr />
<p><em>This is part of my journey building HomeGuard Parivaar — an eldercare robot for Indian families.</em></p>
]]></content:encoded></item><item><title><![CDATA[Machine Learning Basics: Improving Model Performance with Feature Engineering]]></title><description><![CDATA[Introduction
Welcome back to our Machine Learning Basics series! In our previous post, we built a simple linear regression model that achieved an R-squared score of only 0.123. While this gave us a good foundation, the model's predictive power was qu...]]></description><link>https://prahari.net/machine-learning-basics-improving-model-performance-with-feature-engineering</link><guid isPermaLink="true">https://prahari.net/machine-learning-basics-improving-model-performance-with-feature-engineering</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[feature engineering]]></category><dc:creator><![CDATA[Ashish Disawal]]></dc:creator><pubDate>Thu, 02 Oct 2025 17:43:47 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Welcome back to our Machine Learning Basics series! In our <a target="_blank" href="https://blog.prahari.net/machine-learning-basics-building-your-first-simple-linear-regression-model">previous post</a>, we built a simple linear regression model that achieved an R-squared score of only 0.123. While this gave us a good foundation, the model's predictive power was quite limited.</p>
<p>In this tutorial, we'll explore <strong>feature engineering</strong> - one of the most powerful techniques in machine learning. By creating new features from existing data, we'll dramatically improve our model's performance from an R-squared of 0.123 to 0.862!</p>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<p>By the end of this tutorial, you'll understand:</p>
<ul>
<li><p>What feature engineering is and why it matters</p>
</li>
<li><p>How to create new features from domain knowledge</p>
</li>
<li><p>The concept of interaction features</p>
</li>
<li><p>How feature engineering can dramatically improve model performance</p>
</li>
<li><p>The importance of data visualization in feature discovery</p>
</li>
</ul>
<h2 id="heading-what-is-feature-engineering">What is Feature Engineering?</h2>
<p><strong>Feature Engineering</strong> is the process of using domain knowledge to create new features (variables) from existing data that make machine learning algorithms work better. It's often considered more of an art than a science, requiring creativity and understanding of the problem domain.</p>
<p>Good features can:</p>
<ul>
<li><p>Capture important patterns in the data</p>
</li>
<li><p>Make relationships more apparent to the model</p>
</li>
<li><p>Significantly improve model accuracy</p>
</li>
</ul>
<h2 id="heading-getting-started">Getting Started</h2>
<p>Let's begin by loading our dataset and necessary libraries:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Install required modules and load the insurance dataset</span>
!pip install pandas seaborn matplotlib numpy
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
!curl -O https://raw.githubusercontent.com/stedy/Machine-Learning-<span class="hljs-keyword">with</span>-R-datasets/refs/heads/master/insurance.csv
df = pd.read_csv(<span class="hljs-string">'insurance.csv'</span>)
df.head()
</code></pre>
<p>We're using the same insurance dataset from the previous tutorial. Let's quickly remind ourselves what it contains.</p>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">   age     sex     bmi  children smoker     region      charges
<span class="hljs-number">0</span>   <span class="hljs-number">19</span>  female  <span class="hljs-number">27.900</span>         <span class="hljs-number">0</span>    yes  southwest  <span class="hljs-number">16884.92400</span>
<span class="hljs-number">1</span>   <span class="hljs-number">18</span>    male  <span class="hljs-number">33.770</span>         <span class="hljs-number">1</span>     no  southeast   <span class="hljs-number">1725.55230</span>
<span class="hljs-number">2</span>   <span class="hljs-number">28</span>    male  <span class="hljs-number">33.000</span>         <span class="hljs-number">3</span>     no  southeast   <span class="hljs-number">4449.46200</span>
<span class="hljs-number">3</span>   <span class="hljs-number">33</span>    male  <span class="hljs-number">22.705</span>         <span class="hljs-number">0</span>     no  northwest  <span class="hljs-number">21984.47061</span>
<span class="hljs-number">4</span>   <span class="hljs-number">32</span>    male  <span class="hljs-number">28.880</span>         <span class="hljs-number">0</span>     no  northwest   <span class="hljs-number">3866.85520</span>
</code></pre>
<h2 id="heading-verifying-data-structure">Verifying Data Structure</h2>
<p>Before we start engineering features, let's verify our dataset structure:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Check the dataset structure and data count</span>
df.info()
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">&lt;<span class="hljs-class"><span class="hljs-keyword">class</span> '<span class="hljs-title">pandas</span>.<span class="hljs-title">core</span>.<span class="hljs-title">frame</span>.<span class="hljs-title">DataFrame</span>'&gt;
<span class="hljs-title">RangeIndex</span>:</span> <span class="hljs-number">1338</span> entries, <span class="hljs-number">0</span> to <span class="hljs-number">1337</span>
Data columns (total <span class="hljs-number">7</span> columns):
 <span class="hljs-comment">#   Column    Non-Null Count  Dtype</span>
---  ------    --------------  -----
 <span class="hljs-number">0</span>   age       <span class="hljs-number">1338</span> non-null   int64
 <span class="hljs-number">1</span>   sex       <span class="hljs-number">1338</span> non-null   object
 <span class="hljs-number">2</span>   bmi       <span class="hljs-number">1338</span> non-null   float64
 <span class="hljs-number">3</span>   children  <span class="hljs-number">1338</span> non-null   int64
 <span class="hljs-number">4</span>   smoker    <span class="hljs-number">1338</span> non-null   object
 <span class="hljs-number">5</span>   region    <span class="hljs-number">1338</span> non-null   object
 <span class="hljs-number">6</span>   charges   <span class="hljs-number">1338</span> non-null   float64
dtypes: float64(<span class="hljs-number">2</span>), int64(<span class="hljs-number">2</span>), object(<span class="hljs-number">3</span>)
memory usage: <span class="hljs-number">73.3</span>+ KB
</code></pre>
<p>Perfect! We have 1,338 records with no missing values.</p>
<h2 id="heading-data-quality-check">Data Quality Check</h2>
<p>Let's verify that our dataset only contains adult records, since this is health insurance data:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Check if we have any children records</span>
print(df[df[<span class="hljs-string">'age'</span>] &lt; <span class="hljs-number">18</span>])
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">Empty DataFrame
Columns: [age, sex, bmi, children, smoker, region, charges, obese]
Index: []
</code></pre>
<p>Good! All records are for adults (age 18 and above), which makes sense for individual health insurance policies.</p>
<h2 id="heading-feature-engineering-creating-the-obesity-flag">Feature Engineering: Creating the Obesity Flag</h2>
<p>Now comes the exciting part - creating new features! Our first engineered feature will be an <strong>obesity flag</strong> based on medical guidelines.</p>
<p>According to the World Health Organization (WHO):</p>
<ul>
<li><p><strong>Overweight</strong>: BMI ≥ 25</p>
</li>
<li><p><strong>Obese</strong>: BMI ≥ 30</p>
</li>
</ul>
<p>Let's create this feature along with converting our categorical variables to numerical format:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert 'male' to 1 and 'female' to 0 using the .replace() method</span>
df[<span class="hljs-string">'sex'</span>] = df[<span class="hljs-string">'sex'</span>].replace({<span class="hljs-string">'male'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'female'</span>: <span class="hljs-number">0</span>}).astype(<span class="hljs-string">'int8'</span>)
df[<span class="hljs-string">'smoker'</span>] = df[<span class="hljs-string">'smoker'</span>].replace({<span class="hljs-string">'yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'no'</span>: <span class="hljs-number">0</span>}).astype(<span class="hljs-string">'int8'</span>)

<span class="hljs-comment"># Lets add a flag for obesity</span>
<span class="hljs-comment"># As per WHO [https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight]</span>
<span class="hljs-comment"># For adults, WHO defines overweight and obesity as follows:</span>
<span class="hljs-comment"># overweight is a BMI greater than or equal to 25; and</span>
<span class="hljs-comment"># obesity is a BMI greater than or equal to 30.</span>

<span class="hljs-comment"># Use np.where to apply the conditional logic:</span>
<span class="hljs-comment"># Condition: df['bmi'] &gt;= 30</span>
<span class="hljs-comment"># Value if True: 1</span>
<span class="hljs-comment"># Value if False: 0</span>

df[<span class="hljs-string">'obese'</span>] = np.where(df[<span class="hljs-string">'bmi'</span>] &gt;= <span class="hljs-number">30</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>).astype(<span class="hljs-string">'int8'</span>)

<span class="hljs-comment"># Print the modified DataFrame to show the result</span>
print(<span class="hljs-string">"\nDataFrame after converting 'male' to 1 and 'female' to 0:"</span>)
print(df)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">DataFrame after converting <span class="hljs-string">'male'</span> to <span class="hljs-number">1</span> <span class="hljs-keyword">and</span> <span class="hljs-string">'female'</span> to <span class="hljs-number">0</span>:
      age  sex     bmi  children  smoker     region      charges  obese
<span class="hljs-number">0</span>      <span class="hljs-number">19</span>    <span class="hljs-number">0</span>  <span class="hljs-number">27.900</span>         <span class="hljs-number">0</span>       <span class="hljs-number">1</span>  southwest  <span class="hljs-number">16884.92400</span>      <span class="hljs-number">0</span>
<span class="hljs-number">1</span>      <span class="hljs-number">18</span>    <span class="hljs-number">1</span>  <span class="hljs-number">33.770</span>         <span class="hljs-number">1</span>       <span class="hljs-number">0</span>  southeast   <span class="hljs-number">1725.55230</span>      <span class="hljs-number">1</span>
<span class="hljs-number">2</span>      <span class="hljs-number">28</span>    <span class="hljs-number">1</span>  <span class="hljs-number">33.000</span>         <span class="hljs-number">3</span>       <span class="hljs-number">0</span>  southeast   <span class="hljs-number">4449.46200</span>      <span class="hljs-number">1</span>
<span class="hljs-number">3</span>      <span class="hljs-number">33</span>    <span class="hljs-number">1</span>  <span class="hljs-number">22.705</span>         <span class="hljs-number">0</span>       <span class="hljs-number">0</span>  northwest  <span class="hljs-number">21984.47061</span>      <span class="hljs-number">0</span>
<span class="hljs-number">4</span>      <span class="hljs-number">32</span>    <span class="hljs-number">1</span>  <span class="hljs-number">28.880</span>         <span class="hljs-number">0</span>       <span class="hljs-number">0</span>  northwest   <span class="hljs-number">3866.85520</span>      <span class="hljs-number">0</span>
<span class="hljs-meta">... </span>  ...  ...     ...       ...     ...        ...          ...    ...
<span class="hljs-number">1333</span>   <span class="hljs-number">50</span>    <span class="hljs-number">1</span>  <span class="hljs-number">30.970</span>         <span class="hljs-number">3</span>       <span class="hljs-number">0</span>  northwest  <span class="hljs-number">10600.54830</span>      <span class="hljs-number">1</span>
<span class="hljs-number">1334</span>   <span class="hljs-number">18</span>    <span class="hljs-number">0</span>  <span class="hljs-number">31.920</span>         <span class="hljs-number">0</span>       <span class="hljs-number">0</span>  northeast   <span class="hljs-number">2205.98080</span>      <span class="hljs-number">1</span>
<span class="hljs-number">1335</span>   <span class="hljs-number">18</span>    <span class="hljs-number">0</span>  <span class="hljs-number">36.850</span>         <span class="hljs-number">0</span>       <span class="hljs-number">0</span>  southeast   <span class="hljs-number">1629.83350</span>      <span class="hljs-number">1</span>
<span class="hljs-number">1336</span>   <span class="hljs-number">21</span>    <span class="hljs-number">0</span>  <span class="hljs-number">25.800</span>         <span class="hljs-number">0</span>       <span class="hljs-number">0</span>  southwest   <span class="hljs-number">2007.94500</span>      <span class="hljs-number">0</span>
<span class="hljs-number">1337</span>   <span class="hljs-number">61</span>    <span class="hljs-number">0</span>  <span class="hljs-number">29.070</span>         <span class="hljs-number">0</span>       <span class="hljs-number">1</span>  northwest  <span class="hljs-number">29141.36030</span>      <span class="hljs-number">0</span>

[<span class="hljs-number">1338</span> rows x <span class="hljs-number">8</span> columns]
</code></pre>
<p>Notice our new <strong>obese</strong> column! We've now converted the continuous BMI variable into a binary flag. This can sometimes help models capture non-linear relationships more effectively.</p>
<h3 id="heading-why-create-an-obesity-flag">Why Create an Obesity Flag?</h3>
<p>While we already have BMI as a continuous variable, creating a binary obesity flag can help because:</p>
<ul>
<li><p>Medical research shows obesity (BMI ≥ 30) is a distinct risk category</p>
</li>
<li><p>It captures a threshold effect that might be harder for linear models to detect</p>
</li>
<li><p>It's based on domain knowledge from healthcare</p>
</li>
</ul>
<h2 id="heading-visualizing-relationships">Visualizing Relationships</h2>
<p>Let's explore how age and charges are related with some visualizations. This helps us understand our data and discover potential new features.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Explore charges vs age data</span>
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">from</span> matplotlib <span class="hljs-keyword">import</span> pyplot <span class="hljs-keyword">as</span> plt
%matplotlib inline

df.plot(kind=<span class="hljs-string">'scatter'</span>, x=<span class="hljs-string">'age'</span>, y=<span class="hljs-string">'charges'</span>, figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>)).set_title(<span class="hljs-string">"Charges vs Age"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759426888898/71faca69-5b72-48d3-9713-428c372ca63b.png" alt class="image--center mx-auto" /></p>
<p>This scatter plot shows how insurance charges vary with age. Notice the distinct clusters - this suggests there might be important categorical factors affecting charges.</p>
<h2 id="heading-discovering-the-smoking-impact">Discovering the Smoking Impact</h2>
<p>Let's visualize how smoking status affects the relationship between age and charges:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Explore the impact of age and smoking</span>
g = sns.pairplot(data = df[[<span class="hljs-string">'age'</span>, <span class="hljs-string">'sex'</span>, <span class="hljs-string">'bmi'</span>, <span class="hljs-string">'children'</span>, <span class="hljs-string">'smoker'</span>, <span class="hljs-string">'charges'</span>]],
                 x_vars=[<span class="hljs-string">'age'</span>], y_vars=[<span class="hljs-string">'charges'</span>], aspect=<span class="hljs-number">1.5</span>, hue=<span class="hljs-string">'smoker'</span>)
g.fig.set_size_inches(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>)
plt.title(<span class="hljs-string">"Impact of age and smoking on charges"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759426918871/fa8d6c4d-a1ec-4505-82df-99ccf46f2bf8.png" alt class="image--center mx-auto" /></p>
<p>This visualization is revealing! We can see two distinct clusters:</p>
<ul>
<li><p><strong>Non-smokers (blue)</strong>: Lower charges that increase gradually with age</p>
</li>
<li><p><strong>Smokers (orange)</strong>: Significantly higher charges with steeper age-related increases</p>
</li>
</ul>
<p>This suggests that smoking has a major impact on insurance charges, and this impact might vary with age.</p>
<h2 id="heading-exploring-the-obesity-effect">Exploring the Obesity Effect</h2>
<p>Now let's examine how obesity affects the relationship:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Explore the impact of age and obesity on charges</span>
g = sns.pairplot(data = df[[<span class="hljs-string">'age'</span>, <span class="hljs-string">'sex'</span>, <span class="hljs-string">'bmi'</span>,<span class="hljs-string">'obese'</span>, <span class="hljs-string">'children'</span>, <span class="hljs-string">'smoker'</span>, <span class="hljs-string">'charges'</span>]],
                 x_vars=[<span class="hljs-string">'age'</span>], y_vars=[<span class="hljs-string">'charges'</span>], aspect=<span class="hljs-number">1.5</span>, hue=<span class="hljs-string">'obese'</span>)
g.fig.set_size_inches(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>)
plt.title(<span class="hljs-string">"Impact of age and obesity on charges"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759426945800/4364a7b8-5fb5-4568-854d-2560867f270e.png" alt class="image--center mx-auto" /></p>
<p>Obesity also shows a clear effect on insurance charges, though perhaps not as pronounced as smoking.</p>
<h2 id="heading-creating-an-interaction-feature">Creating an Interaction Feature</h2>
<p>Here's where feature engineering gets really powerful. We noticed that both smoking and obesity affect charges. But what about people who are <strong>both</strong> smokers and obese? This combination might have an amplified effect.</p>
<p>This is called an <strong>interaction feature</strong> - a new feature created by combining two or more existing features to capture their combined effect.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Lets create a new feature which represents a product of smoker and obesity feature</span>
df[<span class="hljs-string">'smoker_obese'</span>] = df[<span class="hljs-string">'smoker'</span>] * df[<span class="hljs-string">'obese'</span>]
print(<span class="hljs-string">"Number of customers who are both obese and smoke: "</span>, df[df.smoker_obese == <span class="hljs-number">1</span>].shape[<span class="hljs-number">0</span>])
print(<span class="hljs-string">"Total number of customers: "</span>, df.shape[<span class="hljs-number">0</span>])
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">Number of customers who are both obese <span class="hljs-keyword">and</span> smoke:  <span class="hljs-number">145</span>
Total number of customers:  <span class="hljs-number">1338</span>
</code></pre>
<p>About 10% of customers are both smokers and obese. This is a high-risk group that likely has significantly higher insurance charges.</p>
<h3 id="heading-what-is-an-interaction-feature">What is an Interaction Feature?</h3>
<p>An <strong>interaction feature</strong> captures the combined effect of two or more features. The mathematical operation here is multiplication:</p>
<ul>
<li><p>If someone is obese (1) AND a smoker (1): <code>smoker_obese = 1 × 1 = 1</code></p>
</li>
<li><p>If someone is only obese: <code>smoker_obese = 1 × 0 = 0</code></p>
</li>
<li><p>If someone is only a smoker: <code>smoker_obese = 0 × 1 = 0</code></p>
</li>
<li><p>If neither: <code>smoker_obese = 0 × 0 = 0</code></p>
</li>
</ul>
<p>This allows the model to assign a separate coefficient to this high-risk combination.</p>
<h2 id="heading-preparing-features-for-training">Preparing Features for Training</h2>
<p>Now let's select our features for model training. Notice we're including our newly engineered features:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Lets create new dataframes with the features and one with target</span>
x = df[[<span class="hljs-string">'age'</span>, <span class="hljs-string">'bmi'</span>, <span class="hljs-string">'sex'</span>, <span class="hljs-string">'children'</span>, <span class="hljs-string">'smoker'</span>, <span class="hljs-string">'obese'</span>, <span class="hljs-string">'smoker_obese'</span>]]
y = df[<span class="hljs-string">'charges'</span>]
</code></pre>
<p>Our feature set now includes:</p>
<ul>
<li><p><strong>Original features</strong>: age, bmi, sex, children, smoker</p>
</li>
<li><p><strong>Engineered features</strong>: obese, smoker_obese</p>
</li>
</ul>
<h2 id="heading-training-the-improved-model">Training the Improved Model</h2>
<p>Let's train a linear regression model with our enhanced feature set:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Lets train the model</span>
<span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> linear_model

<span class="hljs-comment"># Create a new Linear Regression model</span>
lr = linear_model.LinearRegression()

<span class="hljs-comment"># Train the model</span>
lr.fit(x, y)

<span class="hljs-comment"># Print the coefficients</span>
ceoffs = pd.DataFrame(lr.coef_, x.columns, columns=[<span class="hljs-string">'Coefficient'</span>])
print(ceoffs)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">               Coefficient
age             <span class="hljs-number">263.807602</span>
bmi              <span class="hljs-number">98.637188</span>
sex            <span class="hljs-number">-488.091970</span>
children        <span class="hljs-number">515.971652</span>
smoker        <span class="hljs-number">13431.633343</span>
obese          <span class="hljs-number">-805.123043</span>
smoker_obese  <span class="hljs-number">19734.622381</span>
</code></pre>
<h3 id="heading-understanding-the-new-coefficients">Understanding the New Coefficients</h3>
<p>Let's interpret what these coefficients tell us:</p>
<ul>
<li><p><strong>age (263.81)</strong>: Each additional year adds ~$264 to charges</p>
</li>
<li><p><strong>bmi (98.64)</strong>: Each BMI unit adds ~$99 to charges (note: much lower than before)</p>
</li>
<li><p><strong>sex (-488.09)</strong>: Males have ~$488 lower charges than females (interesting!)</p>
</li>
<li><p><strong>children (515.97)</strong>: Each child adds ~$516 to charges</p>
</li>
<li><p><strong>smoker (13,431.63)</strong>: Smoking adds a whopping ~$13,432 to charges!</p>
</li>
<li><p><strong>obese (-805.12)</strong>: Obesity flag alone actually shows negative effect (because the interaction term captures the real impact)</p>
</li>
<li><p><strong>smoker_obese (19,734.62)</strong>: Being both a smoker AND obese adds an additional ~$19,735!</p>
</li>
</ul>
<p>The <strong>smoker_obese</strong> coefficient is the highest, confirming our hypothesis that this combination is especially costly.</p>
<h2 id="heading-making-predictions">Making Predictions</h2>
<p>Let's use our improved model to make predictions:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Lets try to predict</span>
predictions = lr.predict(x)
print(predictions)

scores = pd.DataFrame({<span class="hljs-string">'Actual'</span>: y, <span class="hljs-string">'Predicted'</span>: predictions})
scores.head()
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">[<span class="hljs-number">16316.56695109</span>  <span class="hljs-number">2422.88293888</span>  <span class="hljs-number">6016.95162578</span> ...  <span class="hljs-number">2698.805796</span>
  <span class="hljs-number">3205.41071655</span> <span class="hljs-number">27511.89173746</span>]

        Actual     Predicted
<span class="hljs-number">0</span>  <span class="hljs-number">16884.92400</span>  <span class="hljs-number">16316.566951</span>
<span class="hljs-number">1</span>   <span class="hljs-number">1725.55230</span>   <span class="hljs-number">2422.882939</span>
<span class="hljs-number">2</span>   <span class="hljs-number">4449.46200</span>   <span class="hljs-number">6016.951626</span>
<span class="hljs-number">3</span>  <span class="hljs-number">21984.47061</span>   <span class="hljs-number">5577.727872</span>
<span class="hljs-number">4</span>   <span class="hljs-number">3866.85520</span>   <span class="hljs-number">5923.004906</span>
</code></pre>
<p>Notice how much closer the predictions are to the actual values compared to our first model!</p>
<h2 id="heading-evaluating-the-improved-model">Evaluating the Improved Model</h2>
<p>Now for the moment of truth - let's see how much our feature engineering improved the model:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> metrics

print(<span class="hljs-string">'Root Mean Squared Error:'</span>, np.sqrt(metrics.mean_squared_error(y, predictions)))
print(<span class="hljs-string">'Mean Absolute Error:'</span>, metrics.mean_absolute_error(y, predictions))
print(<span class="hljs-string">'Mean Squared Error:'</span>, metrics.mean_squared_error(y, predictions))

print(<span class="hljs-string">"Average Cost:"</span>, y.mean())
print(<span class="hljs-string">"R-squared:"</span>, metrics.r2_score(y, predictions))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">Root Mean Squared Error: <span class="hljs-number">4490.387801338095</span>
Mean Absolute Error: <span class="hljs-number">2460.035500296957</span>
Mean Squared Error: <span class="hljs-number">20163582.606405977</span>
Average Cost: <span class="hljs-number">13270.422265141257</span>
R-squared: <span class="hljs-number">0.8624047908410836</span>
</code></pre>
<h3 id="heading-performance-comparison">Performance Comparison</h3>
<p>Let's compare our improved model with the original:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Metric</td><td>Original Model</td><td>Improved Model</td><td>Change</td></tr>
</thead>
<tbody>
<tr>
<td><strong>RMSE</strong></td><td>$11,336</td><td>$4,490</td><td>✅ 60% reduction</td></tr>
<tr>
<td><strong>MAE</strong></td><td>$8,982</td><td>$2,460</td><td>✅ 73% reduction</td></tr>
<tr>
<td><strong>R-squared</strong></td><td>0.123</td><td>0.862</td><td>✅ 601% increase</td></tr>
</tbody>
</table>
</div><h3 id="heading-what-this-means">What This Means</h3>
<p>Our improved model explains <strong>86.2%</strong> of the variance in insurance charges, compared to just <strong>12.3%</strong> before. This is a dramatic improvement!</p>
<ul>
<li><p><strong>RMSE dropped by 60%</strong>: Our predictions are now much more accurate</p>
</li>
<li><p><strong>MAE dropped by 73%</strong>: The average prediction error is just $2,460 instead of $8,982</p>
</li>
<li><p><strong>R-squared increased to 0.862</strong>: We now explain 86.2% of the variation in charges</p>
</li>
</ul>
<p>This demonstrates the enormous power of feature engineering!</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ol>
<li><p><strong>Feature engineering is powerful</strong>: Simple feature engineering improved R² from 0.123 to 0.862</p>
</li>
<li><p><strong>Domain knowledge matters</strong>: Understanding obesity thresholds helped create meaningful features</p>
</li>
<li><p><strong>Interaction features capture combined effects</strong>: The <code>smoker_obese</code> feature was crucial</p>
</li>
<li><p><strong>Visualization guides feature creation</strong>: Plotting helped us discover the smoking and obesity patterns</p>
</li>
<li><p><strong>Small datasets benefit greatly from good features</strong>: With only 1,338 records, feature engineering was essential</p>
</li>
</ol>
<h2 id="heading-why-did-this-work-so-well">Why Did This Work So Well?</h2>
<p>Our feature engineering succeeded because:</p>
<ol>
<li><p><strong>Domain-driven</strong>: We used medical knowledge (BMI ≥ 30 for obesity) to create meaningful categories</p>
</li>
<li><p><strong>Captured non-linearity</strong>: The obesity flag helped the linear model capture threshold effects</p>
</li>
<li><p><strong>Interaction effects</strong>: The <code>smoker_obese</code> feature captured the amplified risk of combined factors</p>
</li>
<li><p><strong>Data-driven discovery</strong>: Visualization helped us identify which features to engineer</p>
</li>
</ol>
<h2 id="heading-next-steps">Next Steps</h2>
<p>To further improve this model, you could:</p>
<ol>
<li><p><strong>Create more interaction features</strong>: Try <code>age * smoker</code>, <code>bmi * age</code>, etc.</p>
</li>
<li><p><strong>Polynomial features</strong>: Create squared or cubed terms (age², bmi², etc.)</p>
</li>
<li><p><strong>Encode region</strong>: We excluded region - adding it might help</p>
</li>
<li><p><strong>Try other algorithms</strong>: Random Forest or Gradient Boosting might capture even more patterns</p>
</li>
<li><p><strong>Cross-validation</strong>: Use proper train/test splits to validate performance</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Congratulations! You've seen firsthand how powerful feature engineering can be. By adding just two simple features (obesity flag and smoker-obesity interaction), we improved our model's R² from 0.123 to 0.862 - a massive improvement!</p>
<p>This tutorial demonstrates a key principle in machine learning: <strong>Better features often matter more than better algorithms</strong>. Before reaching for complex deep learning models, invest time in understanding your data and engineering meaningful features.</p>
<p>Remember the workflow:</p>
<ol>
<li><p><strong>Explore your data</strong> through visualization</p>
</li>
<li><p><strong>Apply domain knowledge</strong> to create meaningful features</p>
</li>
<li><p><strong>Test interaction effects</strong> between important variables</p>
</li>
<li><p><strong>Evaluate and iterate</strong> on your features</p>
</li>
</ol>
<p>In our next post, we'll explore train-test splits, cross-validation, and how to properly evaluate model performance to avoid overfitting.</p>
<hr />
<p><strong>What's Next?</strong> Stay tuned for our next post where we'll explore proper model validation techniques and introduce regularization!</p>
<p><em>Have questions about feature engineering? Feel free to reach out or leave a comment below.</em></p>
]]></content:encoded></item><item><title><![CDATA[Machine Learning Basics: Building Your First Simple Linear Regression Model]]></title><description><![CDATA[Introduction
Welcome to the first post in our Machine Learning Basics series! In this tutorial, we'll dive into one of the most fundamental algorithms in machine learning: Linear Regression. We'll build a simple linear regression model to predict ins...]]></description><link>https://prahari.net/machine-learning-basics-building-your-first-simple-linear-regression-model</link><guid isPermaLink="true">https://prahari.net/machine-learning-basics-building-your-first-simple-linear-regression-model</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Ashish Disawal]]></dc:creator><pubDate>Sun, 21 Sep 2025 14:30:00 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Welcome to the first post in our Machine Learning Basics series! In this tutorial, we'll dive into one of the most fundamental algorithms in machine learning: <strong>Linear Regression</strong>. We'll build a simple linear regression model to predict insurance charges based on various demographic and health factors.</p>
<p>Linear regression is an excellent starting point for anyone learning machine learning because it's intuitive, interpretable, and forms the foundation for many more complex algorithms.</p>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<p>By the end of this tutorial, you'll understand:</p>
<ul>
<li><p>How to prepare data for machine learning</p>
</li>
<li><p>The basics of linear regression</p>
</li>
<li><p>How to build and train a linear regression model</p>
</li>
<li><p>How to evaluate model performance</p>
</li>
<li><p>How to interpret model coefficients</p>
</li>
</ul>
<h2 id="heading-the-dataset">The Dataset</h2>
<p>We'll be working with a <a target="_blank" href="https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv">health insurance dataset</a> that contains information about:</p>
<ul>
<li><p><strong>Age</strong>: Age of the individual</p>
</li>
<li><p><strong>Sex</strong>: Gender (male/female)</p>
</li>
<li><p><strong>BMI</strong>: Body Mass Index</p>
</li>
<li><p><strong>Children</strong>: Number of children/dependents</p>
</li>
<li><p><strong>Smoker</strong>: Whether the person smokes (yes/no)</p>
</li>
<li><p><strong>Region</strong>: Geographic region</p>
</li>
<li><p><strong>Charges</strong>: Medical insurance charges (our target variable)</p>
</li>
</ul>
<h2 id="heading-getting-started">Getting Started</h2>
<p>First, let's import the necessary libraries and load our dataset:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> linear_model, metrics
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Load the insurance dataset</span>
df = pd.read_csv(<span class="hljs-string">'insurance.csv'</span>)
print(df.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
</code></pre>
<p>Let's examine the structure of our data:</p>
<pre><code class="lang-python">df.info()
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">&lt;class 'pandas.core.frame.DataFrame'&gt;
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       1338 non-null   int64
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
</code></pre>
<p>This gives us important information about our dataset:</p>
<ul>
<li><p><strong>1,338 entries</strong> (rows)</p>
</li>
<li><p><strong>7 columns</strong> with no missing values</p>
</li>
<li><p>Mix of numerical (age, bmi, charges) and categorical (sex, smoker, region) data</p>
</li>
</ul>
<h2 id="heading-data-preprocessing">Data Preprocessing</h2>
<p>Machine learning algorithms work with numerical data, so we need to convert categorical variables to numerical format. This process is called <strong>encoding</strong>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert categorical variables to numerical</span>
df[<span class="hljs-string">'sex'</span>] = df[<span class="hljs-string">'sex'</span>].replace({<span class="hljs-string">'male'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'female'</span>: <span class="hljs-number">0</span>})
df[<span class="hljs-string">'smoker'</span>] = df[<span class="hljs-string">'smoker'</span>].replace({<span class="hljs-string">'yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'no'</span>: <span class="hljs-number">0</span>})

<span class="hljs-comment"># Print the modified DataFrame to show the result</span>
print(<span class="hljs-string">"\nDataFrame after converting 'male' to 1 and 'female' to 0:"</span>)
print(df)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">DataFrame after converting 'male' to 1 and 'female' to 0:
      age  sex     bmi  children  smoker     region      charges
0      19    0  27.900         0       0  southwest  16884.92400
1      18    1  33.770         1       1  southeast   1725.55230
2      28    1  33.000         3       1  southeast   4449.46200
3      33    1  22.705         0       1  northwest  21984.47061
4      32    1  28.880         0       1  northwest   3866.85520
...   ...  ...     ...       ...     ...        ...          ...
1333   50    1  30.970         3       1  northwest  10600.54830
1334   18    0  31.920         0       0  northeast   2205.98080
1335   18    0  36.850         0       0  southeast   1629.83350
1336   21    0  25.800         0       0  southwest   2007.94500
1337   61    0  29.070         0       0  northwest  29141.36030

[1338 rows x 7 columns]
</code></pre>
<p>Perfect! Now we can see that:</p>
<ul>
<li><p><strong>Sex</strong>: <code>female</code> = 0, <code>male</code> = 1</p>
</li>
<li><p><strong>Smoker</strong>: <code>no</code> = 0, <code>yes</code> = 1</p>
</li>
</ul>
<h2 id="heading-exploratory-data-analysis">Exploratory Data Analysis</h2>
<p>Before building our model, it's crucial to understand the relationships in our data. Visualization helps us identify patterns and potential issues.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create pairplot to visualize relationships</span>
sns.pairplot(df)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758473420169/5500cbf7-b5ee-4911-b331-02129be80ceb.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-python"><span class="hljs-comment"># Focus on relationships with our target variable (charges)</span>
sns.pairplot(data=df[[<span class="hljs-string">'age'</span>, <span class="hljs-string">'bmi'</span>, <span class="hljs-string">'children'</span>, <span class="hljs-string">'smoker'</span>, <span class="hljs-string">'sex'</span>, <span class="hljs-string">'charges'</span>]],
             x_vars=[<span class="hljs-string">'age'</span>, <span class="hljs-string">'smoker'</span>, <span class="hljs-string">'bmi'</span>, <span class="hljs-string">'sex'</span>],
             y_vars=<span class="hljs-string">'charges'</span>,
             aspect=<span class="hljs-number">1</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758473505252/d8f4e902-7222-48d1-9e20-2f58f13f2c28.png" alt class="image--center mx-auto" /></p>
<p>These visualizations help us understand:</p>
<ul>
<li><p>Which variables might be good predictors of insurance charges</p>
</li>
<li><p>Whether there are any obvious outliers</p>
</li>
<li><p>The distribution of our data</p>
</li>
</ul>
<h2 id="heading-preparing-the-data-for-machine-learning">Preparing the Data for Machine Learning</h2>
<p>In machine learning, we separate our data into:</p>
<ul>
<li><p><strong>Features (X)</strong>: The input variables we use to make predictions</p>
</li>
<li><p><strong>Target (y)</strong>: The variable we want to predict</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Select features (first 5 columns excluding region)</span>
x = df.iloc[:, :<span class="hljs-number">5</span>]  <span class="hljs-comment"># age, sex, bmi, children, smoker</span>
y = df.iloc[:, <span class="hljs-number">6</span>]   <span class="hljs-comment"># charges</span>

print(x.head())
print(y.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">   age  sex     bmi  children  smoker
0   19    0  27.900         0       0
1   18    1  33.770         1       1
2   28    1  33.000         3       1
3   33    1  22.705         0       1
4   32    1  28.880         0       1

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64
</code></pre>
<h2 id="heading-building-the-linear-regression-model">Building the Linear Regression Model</h2>
<p>Now for the exciting part - building our machine learning model!</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create and train the linear regression model</span>
lr = linear_model.LinearRegression()
lr.fit(x, y)

<span class="hljs-comment"># Display the coefficients</span>
coeffs = pd.DataFrame(lr.coef_, x.columns, columns=[<span class="hljs-string">'Coefficient'</span>])
coeffs
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">          Coefficient
age        241.263511
sex        660.859891
bmi        326.761491
children   533.168130
smoker     660.859891
</code></pre>
<h3 id="heading-understanding-the-coefficients">Understanding the Coefficients</h3>
<p>The coefficients tell us how much each feature influences the insurance charges:</p>
<ul>
<li><p><strong>Age (241.26)</strong>: For each additional year of age, insurance charges increase by ~$241</p>
</li>
<li><p><strong>Sex (660.86)</strong>: Being male (vs female) increases charges by ~$661</p>
</li>
<li><p><strong>BMI (326.76)</strong>: Each unit increase in BMI adds ~$327 to charges</p>
</li>
<li><p><strong>Children (533.17)</strong>: Each additional child increases charges by ~$533</p>
</li>
<li><p><strong>Smoker (660.86)</strong>: Being a smoker increases charges by ~$661</p>
</li>
</ul>
<h2 id="heading-making-predictions">Making Predictions</h2>
<p>Let's use our trained model to make predictions:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Make predictions on our training data</span>
predictions = lr.predict(x)
print(predictions)

<span class="hljs-comment"># Compare actual vs predicted values</span>
scores = pd.DataFrame({<span class="hljs-string">'Actual'</span>: y, <span class="hljs-string">'Predicted'</span>: predictions})
scores.head()
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">[ 6240.68269989  9772.39705015 12999.76207347 ...  8923.93452889
  6037.01059213 16756.06111267]

        Actual     Predicted
0  16884.92400   6240.682700
1   1725.55230   9772.397050
2   4449.46200  12999.762073
3  21984.47061   9242.565695
4   3866.85520  11019.054388
</code></pre>
<h2 id="heading-evaluating-model-performance">Evaluating Model Performance</h2>
<p>It's crucial to evaluate how well our model performs. We'll use several metrics:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Calculate performance metrics</span>
print(<span class="hljs-string">'Root Mean Squared Error:'</span>, np.sqrt(metrics.mean_squared_error(y, predictions)))
print(<span class="hljs-string">'Mean Absolute Error:'</span>, metrics.mean_absolute_error(y, predictions))
print(<span class="hljs-string">'Mean Squared Error:'</span>, metrics.mean_squared_error(y, predictions))

print(<span class="hljs-string">"Average Cost:"</span>, y.mean())
print(<span class="hljs-string">"R-quared:"</span>, metrics.r2_score(y, predictions))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">Root Mean Squared Error: 11336.133773688362
Mean Absolute Error: 8982.350383484953
Mean Squared Error: 128507928.93495792
Average Cost: 13270.422265141257
R-quared: 0.12306876681889345
</code></pre>
<h3 id="heading-understanding-the-metrics">Understanding the Metrics</h3>
<ul>
<li><p><strong>RMSE (11,336)</strong>: On average, our predictions are off by about $11,336</p>
</li>
<li><p><strong>MAE (8,982)</strong>: The average absolute error is about $8,982</p>
</li>
<li><p><strong>R-squared (0.123)</strong>: Our model explains about 12.3% of the variance in insurance charges</p>
</li>
</ul>
<h3 id="heading-what-does-this-mean">What Does This Mean?</h3>
<p>An R-squared of 0.123 means our simple model only explains about 12% of the variation in insurance charges. This suggests that:</p>
<ol>
<li><p><strong>The model is quite basic</strong> - there's room for improvement</p>
</li>
<li><p><strong>Important features might be missing</strong> - perhaps we need more variables</p>
</li>
<li><p><strong>The relationship might not be purely linear</strong> - we might need more sophisticated models</p>
</li>
</ol>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ol>
<li><p><strong>Linear regression is interpretable</strong>: We can easily understand how each feature affects the outcome</p>
</li>
<li><p><strong>Data preprocessing is crucial</strong>: Converting categorical variables to numerical format is essential</p>
</li>
<li><p><strong>Visualization helps</strong>: Exploring data relationships guides model building</p>
</li>
<li><p><strong>Model evaluation is important</strong>: Metrics help us understand model performance</p>
</li>
<li><p><strong>Simple models are a good starting point</strong>: Even basic models provide valuable insights</p>
</li>
</ol>
<h2 id="heading-next-steps">Next Steps</h2>
<p>To improve this model, you could:</p>
<ol>
<li><p><strong>Feature engineering</strong>: Create new features or transform existing ones</p>
</li>
<li><p><strong>Include more variables</strong>: Add the 'region' variable after proper encoding</p>
</li>
<li><p><strong>Try different algorithms</strong>: Random Forest, Support Vector Machines, etc.</p>
</li>
<li><p><strong>Handle outliers</strong>: Identify and address unusual data points</p>
</li>
<li><p><strong>Cross-validation</strong>: Use better evaluation techniques</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Congratulations! You've built your first machine learning model using linear regression. While this simple model has limitations (R² of 0.123), it demonstrates the fundamental machine learning workflow:</p>
<ol>
<li><p><strong>Data collection and exploration</strong></p>
</li>
<li><p><strong>Data preprocessing</strong></p>
</li>
<li><p><strong>Model training</strong></p>
</li>
<li><p><strong>Prediction and evaluation</strong></p>
</li>
</ol>
<p>This foundation will serve you well as you explore more advanced machine learning techniques. In our next post, we'll explore how to improve this model and introduce more sophisticated algorithms.</p>
<hr />
<p><strong>What's Next?</strong> Stay tuned for our next post where we'll explore multiple linear regression with feature engineering and better evaluation techniques!</p>
<p><em>Have questions about this tutorial? Feel free to reach out or leave a comment below.</em></p>
]]></content:encoded></item></channel></rss>