EXAMPLE - Linear Regression, Done Properly
Note: This is an example article demonstrating how to write good technical articles, that is available here. This article is meant to be opened on the side while reading that article.
1. Why I Am Writing This
I wrote this article because linear regression is the first model most of us reach for, and the first one most of us quietly misuse.
I have seen it treated as a checkbox baseline, a justification tool, and sometimes as a substitute for thinking. Coefficients get trusted too quickly, R² gets mistaken for validation, and a single library call gets confused with understanding. When these models fail in real projects, it is almost never because linear regression is weak. It is because the assumptions were never made explicit.
Inside Divami, linear regression shows up everywhere:
- as a sanity check baseline before a more complex model
- as a quick estimate for forecasting and capacity planning
- as an interpretable model when stakeholders ask “why”
- as a component inside larger systems (feature calibration, residual modeling, trend extraction)
I wrote this article because I want to make linear regression reusable without mythology. I will define what it guarantees, what it assumes, how to execute it by hand, how it fails, how to diagnose it, and when not to use it.
2. What This Article Covers (and What It Intentionally Does Not)
I want to be clear about what this article covers and what it does not.
This article is not about “machine learning in general.” It is not about neural networks. It is not about building production ML systems. It is a bounded explanation of linear regression as a technique.
What I Am Explicitly Not Doing Here
I will not cover:
- generalized linear models (logistic regression, Poisson regression)
- regularization variants (Ridge/Lasso/Elastic Net), except as alternatives in Section 7
- matrix calculus or full derivations beyond what is necessary
- statistical inference details (p-values, confidence intervals), except when relevant to misuse
What You Need to Already Be Comfortable With
I expect you to be able to:
- read basic Python and NumPy-style array operations
- reason about functions, errors, and gradients at a high level
- follow algebra involving sums and squares
- interpret simple plots (scatter plot, residual plot)
If you cannot follow a squared error objective and a “best fit line” explanation, stop. Learn that first.
How I Think About Linear Regression Before Touching Any Math
I think of linear regression as a transformation from observed data to a linear predictor.
Inputs:
- feature matrix $X$ with shape
(n_samples, n_features) - target vector $y$ with shape
(n_samples,)
Outputs:
- coefficients $w$ with shape
(n_features,) - intercept $b$ (optional depending on formulation)
- predictions $ \hat{y} = Xw + b$
Guarantee (under stated assumptions):
- the chosen parameters minimize the sum of squared residuals on the given dataset
Mermaid boundary view:
flowchart LR
A["Input: Features X<br/>n x d"] --> B["Linear Model<br/>ŷ = Xw + b"]
C["Input: Targets y<br/>n"] --> D["Objective<br/>min Σ(ŷ - y)^2"]
B --> D
D --> E["Output: Coefficients w,<br/>Intercept b"]
E --> F["Predictions ŷ"]
3. The Language I Will Use (So We Don’t Get Lost Later)
I use the following terms consistently:
- Feature ($x$): an input variable used to predict the target.
- Target ($y$): the value we want to predict.
- Prediction ($\hat{y}$): the model output for a given input.
- Residual ($e$): $e = y - \hat{y}$.
- Loss / Objective: the scalar we minimize; here, sum of squared residuals.
- Least Squares: the method of choosing parameters to minimize squared residuals.
Conventions in this article:
- Scalars are lowercase ($x, y$), vectors are plain ($w, y$), matrices are uppercase ($X$).
- I assume an intercept term exists unless explicitly stated otherwise.
- “Assumption” means a condition required for interpretation and stability, not for computing the line.
4. What Linear Regression Actually Guarantees
What I rely on here is the idea that linear regression is not “finding the true relationship.” It is “finding the best linear approximation under squared error.”
Core Concept
I choose parameters $(w, b)$ that minimize:
$$ \sum_i \left( (w \cdot x_i + b) - y_i \right)^2 $$
This is least squares.
Non-Negotiable Invariants
- Objective is squared error
Changing the objective changes the solution. - Model is linear in parameters
The input can be transformed (polynomials, splines), but the model remains linear in $w$. - The chosen solution minimizes training squared error
This is a guarantee about optimization on the provided dataset, not about generalization. - Residuals define the fit
If residuals are structured, the model is missing signal.
Demonstration on Paper
Smallest non-trivial case: one feature, two points.
Points: ($x_1, y_1$) = (0, 1), ($x_2, y_2$) = (2, 5).
A line $y = wx + b$ passing through both points has:
- from $x=0$: $b = 1$
- from $x=2$: $5 = 2w + 1 \Rightarrow w = 2$
So $ \hat{y} = 2x + 1$.
No matrices. No libraries. This is what “fit” means.
Pause-and-Verify Checkpoint
Before computing $w$, I predict its sign:
- if higher $x$ corresponds to higher $y$, slope must be positive.
If my computed slope contradicts this, something is wrong in my setup.
Counter-Example That Looks Valid But Is Wrong
If I force $b = 0$ (no intercept) on the same points:
- I get a line that cannot represent the $x=0, y=1$ point
- the fit will distort the slope to compensate
- coefficients become “wrong” not because the math failed, but because I violated the boundary (missing intercept)
This is a common real-world misuse: people drop the intercept without understanding the consequence.
5. Making It Concrete Without Hiding Behind Libraries
I want to show that the invariants survive implementation, not just theory.
Minimal Reproducible Notebook Logic
Python sketch (runnable as-is with NumPy):
import numpy as np
# Toy data
X = np.array([[0.0], [2.0], [3.0], [4.0]])
y = np.array([1.0, 5.0, 7.0, 9.0])
# Add intercept column
X_design = np.c_[np.ones(len(X)), X] # [1, x]
# Closed-form least squares: $\beta = (X^\top X)^{-1} X^\top y$
beta = np.linalg.inv(X_design.T @ X_design) @ (X_design.T @ y)
b, w = beta[0], beta[1]
y_hat = X_design @ beta
residuals = y - y_hat
print("b:", b, "w:", w)
print("SSE:", np.sum(residuals**2))
Mapping Back to Invariants
- I am explicitly minimizing squared error by using the least squares closed form.
- The model is linear in parameters ($b, w$).
- Residuals define fit quality, not coefficient aesthetics.
Signals and Diagnostics
Expected signals on clean linear data:
- SSE should be near zero if points lie on a line.
- Residuals should have no pattern when plotted against $x$.
Failure signals:
- residuals grow with $x$ (heteroscedasticity)
- residuals curve (nonlinearity)
- coefficients change drastically with small data perturbations (instability)
6. How This Usually Breaks in Real Projects
What usually goes wrong when I apply this is that linear regression fails silently. That is the real danger.
Failure Mode 1: Outliers Dominate
Symptom:
- one extreme point rotates the line
Diagnostic:
- residual plot shows one point with massive error
- coefficients shift drastically after removing that point
Fix options:
- robust regression (Huber, RANSAC)
- winsorization or outlier handling (only if justified)
Failure Mode 2: Multicollinearity
Symptom:
- coefficients have unstable magnitudes or signs
- model predictions remain similar, but interpretation becomes nonsense
Diagnostic:
- high correlation between features
- variance inflation factors (VIF) explode
- coefficients swing wildly across folds
Fix options:
- drop redundant features
- use Ridge regression if prediction is the goal
- avoid coefficient interpretation under collinearity
Failure Mode 3: Data Leakage
Symptom:
- extremely high R² on validation that disappears in production
Diagnostic:
- the model sees future information embedded in features
- time split performance collapses compared to random split
Fix:
- enforce time-aware validation
- audit features for leakage pathways
Debugging Narrative: What Usually Happens
What I usually expect the first time is simple: I call linear regression, get coefficients, and move on.
What actually happens is more subtle. The coefficients look reasonable. The R² looks high. And then, weeks later, production drift shows up and the residuals start carrying structure. The mistake is assuming that a clean fit means a valid model. Every time this has bitten me, the fix was the same: restate the assumptions, plot the residuals, and check whether the invariants still held.
7. Variations, Alternatives, and Why I Still Reach for This First
I want to highlight some variations and alternatives to linear regression.
Variations
- Polynomial regression: still linear in parameters, but expands features ($x, x^2, x^3$...). Helps curvature, increases overfitting risk.
- Weighted least squares: changes objective to weight some errors more than others. Useful when variance changes across range.
- Regularized regression (Ridge/Lasso): changes objective by adding penalty terms. Improves stability, changes coefficient interpretation.
Alternatives
- Decision trees: handle nonlinearities, less interpretable, can be unstable.
- k-NN regression: local averaging, sensitive to scaling, poor extrapolation.
- Gradient boosting: strong performance, low transparency, more operational complexity.
Tradeoffs
- Linear regression is cheap, interpretable, and fast.
- It is fragile under assumption violations and misleading when interpreted casually.
- If interpretability is the goal, assumptions must be defended.
- If prediction is the goal, regularization often dominates.
When Not to Use Linear Regression
- relationship is strongly nonlinear and cannot be made linear via feature engineering
- error distribution is heavy-tailed and outliers are meaningful
- features are highly collinear and coefficients are used for “causal stories”
- leakage risk is high and validation discipline is weak
8. How I Judge Whether a Regression Is Actually Sound
Correct implementation heuristics:
- intercept handling is explicit
- residuals are examined, not ignored
- evaluation uses the correct split (time split when applicable)
- coefficients are stable across folds if interpretation is claimed
Misuse heuristics:
- using R² as the only validation
- reporting coefficients as “importance” without checking collinearity
- dropping the intercept “because it makes the math simpler”
- training on randomized splits for time-dependent data
Heart vs frills:
- heart: objective, invariants, residual reasoning, boundary correctness
- frills: which library, which solver, which wrapper API
9. What You Should Be Able to Do After This
If you read this article and still cannot do this, it is a strong signal that your understanding is still incomplete.
A Simple Test to Know If You Actually Understand This
Task:
- given 5 points in 1D, compute $(w, b)$ by hand using either:
- the two-point slope/intercept logic when appropriate, or
- the closed-form formula for the general case
- then verify your answer by computing predictions and SSE manually
If you cannot reproduce the coefficients without a library, you do not own the concept.
When It’s Reasonable to Stop Digging Deeper
You can stop once you can:
- state the assumptions clearly
- fit by hand on a tiny dataset
- diagnose nonlinearity using residual plots
- explain multicollinearity and why it breaks coefficient interpretation
- pick a better alternative when linear regression is the wrong tool
10. How This Is Meant to Be Used Inside Divami
Where this article should be used at Divami:
- as the default baseline reference for any ML modeling task
- during design reviews when someone proposes “a quick regression”
- during debugging when a model is “working” but behaving strangely
- during onboarding for any engineer touching analytics or ML components
Where it should not be used:
- as a justification for causal claims without proper causal design
- as a replacement for validation discipline
- as a “simple model” excuse when nonlinearity is clearly dominant
Ownership and update policy:
- Update when: new internal patterns of misuse appear, or new preferred baselines emerge
- Mark stale when: library defaults change in ways that affect intercept handling or regularization
Retrieval hooks:
- keywords: least squares, OLS, regression baseline, residual analysis, collinearity
- companion artifacts: a minimal notebook and a small repo used in internal trainings
Conclusion
I do not think linear regression is trivial. I think it is deceptively simple.
I still use it constantly. As a baseline, as a diagnostic, and as a way to sanity‑check more complex models. But every time I use it, I explicitly defend its assumptions, because I have seen how misleading it can be when treated casually.
It gives you a baseline, an interpretation surface, and a fast diagnostic tool. It also gives you failure modes that hide behind plausible outputs.
If you take one thing from this article, take this:
- coefficients are not truth
- residuals are the truth
- assumptions are the contract
Before You Use This in a Real Project
Before shipping a linear regression model:
- confirm intercept handling is correct
- plot residuals and look for structure
- verify splits match the real deployment scenario
- check coefficient stability if interpretation is claimed
- defend assumptions explicitly in writing
References
Foundational:
- Gauss, Legendre: least squares origins (historical foundation)
- Linear algebra texts covering normal equations and projections (core geometry)
Practical:
- Applied residual diagnostics writeups and regression checklists (debug discipline)
Libraries and syntax:
- NumPy linalg.lstsq, scikit-learn LinearRegression docs (implementation details)
Comparative:
- Ridge/Lasso references for stability under collinearity (tradeoff neighborhood)