EXAMPLE - Linear Regression, Done Properly

December 28, 2025

Machine Learning Statistics Linear Regression Fundamentals Debugging

Note: This is an example article demonstrating how to write good technical articles, that is available here. This article is meant to be opened on the side while reading that article.

1. Why I Am Writing This

I wrote this article because linear regression is the first model most of us reach for, and the first one most of us quietly misuse.

I have seen it treated as a checkbox baseline, a justification tool, and sometimes as a substitute for thinking. Coefficients get trusted too quickly, R² gets mistaken for validation, and a single library call gets confused with understanding. When these models fail in real projects, it is almost never because linear regression is weak. It is because the assumptions were never made explicit.

Inside Divami, linear regression shows up everywhere:

as a sanity check baseline before a more complex model
as a quick estimate for forecasting and capacity planning
as an interpretable model when stakeholders ask “why”
as a component inside larger systems (feature calibration, residual modeling, trend extraction)

I wrote this article because I want to make linear regression reusable without mythology. I will define what it guarantees, what it assumes, how to execute it by hand, how it fails, how to diagnose it, and when not to use it.

2. What This Article Covers (and What It Intentionally Does Not)

I want to be clear about what this article covers and what it does not.

This article is not about “machine learning in general.” It is not about neural networks. It is not about building production ML systems. It is a bounded explanation of linear regression as a technique.

What I Am Explicitly Not Doing Here

I will not cover:

generalized linear models (logistic regression, Poisson regression)
regularization variants (Ridge/Lasso/Elastic Net), except as alternatives in Section 7
matrix calculus or full derivations beyond what is necessary
statistical inference details (p-values, confidence intervals), except when relevant to misuse

What You Need to Already Be Comfortable With

I expect you to be able to:

read basic Python and NumPy-style array operations
reason about functions, errors, and gradients at a high level
follow algebra involving sums and squares
interpret simple plots (scatter plot, residual plot)

If you cannot follow a squared error objective and a “best fit line” explanation, stop. Learn that first.

How I Think About Linear Regression Before Touching Any Math

I think of linear regression as a transformation from observed data to a linear predictor.

Inputs:

feature matrix $X$ with shape (n_samples, n_features)
target vector $y$ with shape (n_samples,)

Outputs:

coefficients $w$ with shape (n_features,)
intercept $b$ (optional depending on formulation)
predictions $ \hat{y} = Xw + b$

Guarantee (under stated assumptions):

the chosen parameters minimize the sum of squared residuals on the given dataset

Mermaid boundary view:

flowchart LR
    A["Input: Features X<br/>n x d"] --> B["Linear Model<br/>ŷ = Xw + b"]
    C["Input: Targets y<br/>n"] --> D["Objective<br/>min Σ(ŷ - y)^2"]
    B --> D
    D --> E["Output: Coefficients w,<br/>Intercept b"]
    E --> F["Predictions ŷ"]

3. The Language I Will Use (So We Don’t Get Lost Later)

I use the following terms consistently:

Feature ($x$): an input variable used to predict the target.
Target ($y$): the value we want to predict.
Prediction ($\hat{y}$): the model output for a given input.
Residual ($e$): $e = y - \hat{y}$.
Loss / Objective: the scalar we minimize; here, sum of squared residuals.
Least Squares: the method of choosing parameters to minimize squared residuals.

Conventions in this article:

Scalars are lowercase ($x, y$), vectors are plain ($w, y$), matrices are uppercase ($X$).
I assume an intercept term exists unless explicitly stated otherwise.
“Assumption” means a condition required for interpretation and stability, not for computing the line.

4. What Linear Regression Actually Guarantees

What I rely on here is the idea that linear regression is not “finding the true relationship.” It is “finding the best linear approximation under squared error.”

Core Concept

I choose parameters $(w, b)$ that minimize:

$$ \sum_i \left( (w \cdot x_i + b) - y_i \right)^2 $$

This is least squares.

Non-Negotiable Invariants

Objective is squared error
Changing the objective changes the solution.
Model is linear in parameters
The input can be transformed (polynomials, splines), but the model remains linear in $w$.
The chosen solution minimizes training squared error
This is a guarantee about optimization on the provided dataset, not about generalization.
Residuals define the fit
If residuals are structured, the model is missing signal.

Demonstration on Paper

Smallest non-trivial case: one feature, two points.

Points: ($x_1, y_1$) = (0, 1), ($x_2, y_2$) = (2, 5).

A line $y = wx + b$ passing through both points has:

from $x=0$: $b = 1$
from $x=2$: $5 = 2w + 1 \Rightarrow w = 2$

So $ \hat{y} = 2x + 1$.

No matrices. No libraries. This is what “fit” means.

Pause-and-Verify Checkpoint

Before computing $w$, I predict its sign:

if higher $x$ corresponds to higher $y$, slope must be positive.

If my computed slope contradicts this, something is wrong in my setup.

Counter-Example That Looks Valid But Is Wrong

If I force $b = 0$ (no intercept) on the same points:

I get a line that cannot represent the $x=0, y=1$ point
the fit will distort the slope to compensate
coefficients become “wrong” not because the math failed, but because I violated the boundary (missing intercept)

This is a common real-world misuse: people drop the intercept without understanding the consequence.

5. Making It Concrete Without Hiding Behind Libraries

I want to show that the invariants survive implementation, not just theory.

Minimal Reproducible Notebook Logic

Python sketch (runnable as-is with NumPy):

import numpy as np

# Toy data
X = np.array([[0.0], [2.0], [3.0], [4.0]])
y = np.array([1.0, 5.0, 7.0, 9.0])

# Add intercept column
X_design = np.c_[np.ones(len(X)), X]  # [1, x]

# Closed-form least squares: $\beta = (X^\top X)^{-1} X^\top y$
beta = np.linalg.inv(X_design.T @ X_design) @ (X_design.T @ y)

b, w = beta[0], beta[1]
y_hat = X_design @ beta
residuals = y - y_hat

print("b:", b, "w:", w)
print("SSE:", np.sum(residuals**2))

Mapping Back to Invariants

I am explicitly minimizing squared error by using the least squares closed form.
The model is linear in parameters ($b, w$).
Residuals define fit quality, not coefficient aesthetics.

Signals and Diagnostics

Expected signals on clean linear data:

SSE should be near zero if points lie on a line.
Residuals should have no pattern when plotted against $x$.

Failure signals:

residuals grow with $x$ (heteroscedasticity)
residuals curve (nonlinearity)
coefficients change drastically with small data perturbations (instability)

6. How This Usually Breaks in Real Projects

What usually goes wrong when I apply this is that linear regression fails silently. That is the real danger.

Failure Mode 1: Outliers Dominate

Symptom:

one extreme point rotates the line

Diagnostic:

residual plot shows one point with massive error
coefficients shift drastically after removing that point

Fix options:

robust regression (Huber, RANSAC)
winsorization or outlier handling (only if justified)

Failure Mode 2: Multicollinearity

Symptom:

coefficients have unstable magnitudes or signs
model predictions remain similar, but interpretation becomes nonsense

Diagnostic:

high correlation between features
variance inflation factors (VIF) explode
coefficients swing wildly across folds

Fix options:

drop redundant features
use Ridge regression if prediction is the goal
avoid coefficient interpretation under collinearity

Failure Mode 3: Data Leakage

Symptom:

extremely high R² on validation that disappears in production

Diagnostic:

the model sees future information embedded in features
time split performance collapses compared to random split

Fix:

enforce time-aware validation
audit features for leakage pathways

Debugging Narrative: What Usually Happens

What I usually expect the first time is simple: I call linear regression, get coefficients, and move on.

What actually happens is more subtle. The coefficients look reasonable. The R² looks high. And then, weeks later, production drift shows up and the residuals start carrying structure. The mistake is assuming that a clean fit means a valid model. Every time this has bitten me, the fix was the same: restate the assumptions, plot the residuals, and check whether the invariants still held.

7. Variations, Alternatives, and Why I Still Reach for This First

I want to highlight some variations and alternatives to linear regression.

Variations

Polynomial regression: still linear in parameters, but expands features ($x, x^2, x^3$...). Helps curvature, increases overfitting risk.
Weighted least squares: changes objective to weight some errors more than others. Useful when variance changes across range.
Regularized regression (Ridge/Lasso): changes objective by adding penalty terms. Improves stability, changes coefficient interpretation.

Alternatives

Decision trees: handle nonlinearities, less interpretable, can be unstable.
k-NN regression: local averaging, sensitive to scaling, poor extrapolation.
Gradient boosting: strong performance, low transparency, more operational complexity.

Tradeoffs

Linear regression is cheap, interpretable, and fast.
It is fragile under assumption violations and misleading when interpreted casually.
If interpretability is the goal, assumptions must be defended.
If prediction is the goal, regularization often dominates.

When Not to Use Linear Regression

relationship is strongly nonlinear and cannot be made linear via feature engineering
error distribution is heavy-tailed and outliers are meaningful
features are highly collinear and coefficients are used for “causal stories”
leakage risk is high and validation discipline is weak

8. How I Judge Whether a Regression Is Actually Sound

Correct implementation heuristics:

intercept handling is explicit
residuals are examined, not ignored
evaluation uses the correct split (time split when applicable)
coefficients are stable across folds if interpretation is claimed

Misuse heuristics:

using R² as the only validation
reporting coefficients as “importance” without checking collinearity
dropping the intercept “because it makes the math simpler”
training on randomized splits for time-dependent data

Heart vs frills:

heart: objective, invariants, residual reasoning, boundary correctness
frills: which library, which solver, which wrapper API

9. What You Should Be Able to Do After This

If you read this article and still cannot do this, it is a strong signal that your understanding is still incomplete.

A Simple Test to Know If You Actually Understand This

Task:

given 5 points in 1D, compute $(w, b)$ by hand using either:
- the two-point slope/intercept logic when appropriate, or
- the closed-form formula for the general case
then verify your answer by computing predictions and SSE manually

If you cannot reproduce the coefficients without a library, you do not own the concept.

When It’s Reasonable to Stop Digging Deeper

You can stop once you can:

state the assumptions clearly
fit by hand on a tiny dataset
diagnose nonlinearity using residual plots
explain multicollinearity and why it breaks coefficient interpretation
pick a better alternative when linear regression is the wrong tool

10. How This Is Meant to Be Used Inside Divami

Where this article should be used at Divami:

as the default baseline reference for any ML modeling task
during design reviews when someone proposes “a quick regression”
during debugging when a model is “working” but behaving strangely
during onboarding for any engineer touching analytics or ML components

Where it should not be used:

as a justification for causal claims without proper causal design
as a replacement for validation discipline
as a “simple model” excuse when nonlinearity is clearly dominant

Ownership and update policy:

Update when: new internal patterns of misuse appear, or new preferred baselines emerge
Mark stale when: library defaults change in ways that affect intercept handling or regularization

Retrieval hooks:

keywords: least squares, OLS, regression baseline, residual analysis, collinearity
companion artifacts: a minimal notebook and a small repo used in internal trainings

Conclusion

I do not think linear regression is trivial. I think it is deceptively simple.

I still use it constantly. As a baseline, as a diagnostic, and as a way to sanity‑check more complex models. But every time I use it, I explicitly defend its assumptions, because I have seen how misleading it can be when treated casually.

It gives you a baseline, an interpretation surface, and a fast diagnostic tool. It also gives you failure modes that hide behind plausible outputs.

If you take one thing from this article, take this:

coefficients are not truth
residuals are the truth
assumptions are the contract

Before You Use This in a Real Project

Before shipping a linear regression model:

confirm intercept handling is correct
plot residuals and look for structure
verify splits match the real deployment scenario
check coefficient stability if interpretation is claimed
defend assumptions explicitly in writing

References

Foundational:

Gauss, Legendre: least squares origins (historical foundation)
Linear algebra texts covering normal equations and projections (core geometry)

Practical:

Applied residual diagnostics writeups and regression checklists (debug discipline)

Libraries and syntax:

NumPy linalg.lstsq, scikit-learn LinearRegression docs (implementation details)

Comparative:

Ridge/Lasso references for stability under collinearity (tradeoff neighborhood)