Loss Functions: Measuring Model Performance

Introduction

Loss functions are the compass that guides neural networks during training. They quantify how far your model's predictions are from the true values, providing a single numerical score that optimization algorithms use to improve the model. The choice of loss function can make or break your model's performance—use the wrong one, and your model might never converge or learn the wrong patterns entirely.

Every training iteration follows the same pattern: forward pass → compute loss → backward pass → update weights. The loss function sits at the heart of this cycle, translating the gap between predictions and reality into gradients that drive learning.

graph LR
    A[Input Data] --> B[Forward Pass]
    B --> C[Model Predictions]
    C --> D[Compute Loss]
    E[True Labels] --> D
    D --> F[Backward Pass]
    F --> G[Compute Gradients]
    G --> H[Update Weights]
    H --> B
    style D fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px,color:#fff

In this article, we'll explore the most commonly used loss functions, their mathematical foundations, when to use each one, and how to avoid common pitfalls that can derail training.

Loss Functions

What is a Loss Function?

A loss function (also called a cost function or objective function) is a mathematical function that measures the discrepancy between predicted values ($\hat{y}$) and true values ($y$). It outputs a single scalar value that represents how well (or poorly) your model is performing.

Mathematical Definition:

$$ \mathcal{L} = f(\hat{y}, y) $$

Where:

  • $\mathcal{L}$ is the loss value
  • $\hat{y}$ are the model's predictions
  • $y$ are the true target values
  • $f$ is the loss function

The Training Objective: Minimize the loss function

$$ \theta^* = \arg\min_{\theta} \mathcal{L}(f_\theta(x), y) $$

Where $\theta$ represents the model's parameters (weights and biases).

Deep Learning

Reduction Methods

Most loss functions compute a loss value for each sample in a batch. PyTorch provides three ways to aggregate these individual losses:

Reduction Formula Use Case
mean (default) $\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \ell_i$ Most common; provides stable gradients
sum $\mathcal{L} = \sum_{i=1}^{N} \ell_i$ When you want absolute loss magnitude; rare
none $\mathcal{L} = [\ell_1, \ell_2, ..., \ell_N]$ For custom aggregation or per-sample analysis
import torch.nn as nn

# Different reduction modes
criterion_mean = nn.MSELoss(reduction='mean')   # Default
criterion_sum = nn.MSELoss(reduction='sum')
criterion_none = nn.MSELoss(reduction='none')   # Returns loss per sample

Regression Losses: For continuous predictions (e.g., prices). Classification Losses: For discrete labels (e.g., spam/not spam).


Regression Loss Functions

Regression tasks involve predicting continuous values (house prices, temperature, stock prices). The choice of loss function determines how the model penalizes prediction errors.

graph LR
    A[Regression Tasks] --> B[MSE]
    A --> C[MAE]
    A --> D[Smooth L1]
    B --> E[Quadratic Penalty<br/>Sensitive to outliers]
    C --> F[Linear Penalty<br/>Robust to outliers]
    D --> G[Hybrid Approach<br/>Best of both]
    style B fill:#fa5252,stroke:#c92a2a
    style C fill:#51cf66,stroke:#2f9e44
    style D fill:#ffd43b,stroke:#f59f00

1. Mean Squared Error (MSE) / L2 Loss

Mathematical Formula:

$$ \text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 $$

Derivative:

$$ \frac{\partial L}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i) $$

Key Characteristics:

  • Quadratic penalty: Errors are squared, so large errors are penalized exponentially more than small ones
  • Differentiable everywhere: Smooth gradients throughout
  • Scale-dependent: Sensitive to the scale of the target variable

Advantages:

  • Most commonly used regression loss
  • Unique optimal solution (convex for linear models)
  • Penalizes large errors heavily, encouraging accurate predictions
  • Well-understood mathematical properties
  • Differentiable everywhere

Disadvantages:

  • Very sensitive to outliers: A single outlier can dominate the loss
  • Can lead to exploding gradients if predictions are far from targets
  • Not robust to noisy data
  • Assumes errors are normally distributed

Best Use Cases:

  • Clean datasets without outliers
  • When large errors are significantly worse than small errors
  • Default choice for most regression problems
  • When you want to minimize variance in predictions

PyTorch Implementation:

import torch
import torch.nn as nn

# Basic usage
predictions = torch.tensor([2.5, 3.0, 4.5])
targets = torch.tensor([2.0, 3.5, 4.0])

mse_loss = nn.MSELoss()
loss = mse_loss(predictions, targets)
print(f"MSE Loss: {loss.item():.4f}")
# Output: MSE Loss: 0.1667

Practical Example:

import torch
import torch.nn as nn

class RegressionModel(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)  # Single output for regression
        )
    
    def forward(self, x):
        return self.model(x)

# Training setup
model = RegressionModel(input_size=10)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(100):
    predictions = model(X_train)
    loss = criterion(predictions, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

2. Mean Absolute Error (MAE) / L1 Loss

Mathematical Formula:

$$ \text{MAE} = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i| $$

Derivative:

$$ \frac{\partial L}{\partial \hat{y}_i} = \frac{1}{N} \cdot \text{sgn}(\hat{y}_i - y_i) $$

Key Characteristics:

  • Linear penalty: All errors weighted equally regardless of magnitude
  • Robust to outliers: Outliers don't dominate the loss
  • Constant gradient: Gradient magnitude is the same for all errors

Advantages:

  • Robust to outliers: Outliers have linear (not quadratic) impact
  • More interpretable: Loss value is in the same units as target variable
  • Stable gradients (constant magnitude)
  • Better for data with many outliers or heavy-tailed distributions

Disadvantages:

  • Gradient is constant, which can lead to instability near convergence (large updates even when close to optimum)
  • Not differentiable at zero (but subgradient works in practice)
  • May converge slower than MSE for clean data
  • Can oscillate around the minimum without proper learning rate decay

Best Use Cases:

  • Datasets with outliers or noisy labels
  • When all errors should be weighted equally
  • Predicting medians rather than means
  • Robust regression tasks

PyTorch Implementation:

import torch
import torch.nn as nn

predictions = torch.tensor([2.5, 3.0, 10.0])  # 10.0 is an outlier
targets = torch.tensor([2.0, 3.5, 4.0])

# MSE is heavily affected by the outlier
mse_loss = nn.MSELoss()
print(f"MSE Loss: {mse_loss(predictions, targets).item():.4f}")
# Output: MSE Loss: 12.1667 (dominated by outlier)

# MAE is more robust
mae_loss = nn.L1Loss()
print(f"MAE Loss: {mae_loss(predictions, targets).item():.4f}")
# Output: MAE Loss: 2.1667 (less affected by outlier)

Comparison: MSE vs MAE with Outliers

import torch
import matplotlib.pyplot as plt

# Predictions with one outlier
predictions = torch.tensor([2.1, 3.0, 2.9, 3.2, 15.0])  # 15.0 is outlier
targets = torch.tensor([2.0, 3.0, 3.0, 3.0, 3.0])

mse = nn.MSELoss()(predictions, targets)
mae = nn.L1Loss()(predictions, targets)

print(f"MSE: {mse:.2f} (heavily penalizes outlier)")
print(f"MAE: {mae:.2f} (more robust to outlier)")
# MSE: 28.88
# MAE: 2.44

MSE v/s MAE Loss Curves

3. Smooth L1 Loss (Huber Loss)

Mathematical Formula:

$$ \text{SmoothL1}(x) = \begin{cases} 0.5x^2 & |x| < \beta \
\beta(|x| - 0.5\beta) & |x| \geq \beta \end{cases} $$

Where $x = \hat{y} - y$ and $\beta$ is a threshold (default: 1.0 in PyTorch)

Key Characteristics:

  • Hybrid approach: Combines MSE (small errors) and MAE (large errors)
  • Smooth transition: Differentiable everywhere
  • Less sensitive to outliers than MSE but converges faster than MAE

Advantages:

  • Best of both worlds: Fast convergence + outlier robustness
  • Smooth gradients (unlike MAE at zero)
  • Less sensitive to outliers than MSE
  • Controlled gradient magnitude for large errors
  • Used successfully in object detection (Faster R-CNN)

Disadvantages:

  • Requires tuning $\beta$ hyperparameter
  • More complex than MSE or MAE
  • Less interpretable
  • Slightly more computational overhead

Best Use Cases:

  • Object detection (bounding box regression)
  • When you expect some outliers but want faster convergence than MAE
  • Tasks requiring both precision (small errors) and robustness (large errors)
  • Default choice for many computer vision regression tasks

PyTorch Implementation:

import torch
import torch.nn as nn

predictions = torch.tensor([2.5, 3.0, 10.0])
targets = torch.tensor([2.0, 3.5, 4.0])

smooth_l1 = nn.SmoothL1Loss(beta=1.0)
loss = smooth_l1(predictions, targets)
print(f"Smooth L1 Loss: {loss.item():.4f}")

# Compare with MSE and MAE
print(f"MSE Loss: {nn.MSELoss()(predictions, targets).item():.4f}")
print(f"MAE Loss: {nn.L1Loss()(predictions, targets).item():.4f}")

Visualizing the Difference:

import torch
import matplotlib.pyplot as plt

errors = torch.linspace(-3, 3, 100)

# MSE: Quadratic everywhere
mse = 0.5 * errors ** 2

# MAE: Linear everywhere
mae = torch.abs(errors)

# Smooth L1: Quadratic near zero, linear for large errors
smooth_l1 = torch.where(
    torch.abs(errors) < 1,
    0.5 * errors ** 2,
    torch.abs(errors) - 0.5
)

# The smooth L1 curve combines the best of both

Huber Loss Function


Classification Loss Functions

Classification tasks involve predicting discrete categories. The loss function must handle probability distributions and class labels effectively.

4. Binary Cross-Entropy with Logits (BCEWithLogitsLoss)

Mathematical Formula:

$$ \text{BCEWithLogits} = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log(\sigma(z_i)) + (1-y_i)\log(1-\sigma(z_i))] $$

Where:

  • $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function
  • $z$ are raw logits (model outputs)
  • $y_i \in {0, 1}$ are true labels

Key Characteristics:

This loss combines sigmoid and binary cross-entropy in a single numerically stable operation:

  • Numerical stability: Uses log-sum-exp trick to avoid overflow/underflow
  • Better gradients: Prevents gradient saturation
  • Simpler architecture: No sigmoid needed in model

Advantages:

  • Numerically stable: Uses log-sum-exp trick internally
  • More stable gradients during training
  • No need for sigmoid in model (cleaner architecture)
  • Recommended over plain BCE in almost all cases
  • Can handle pos_weight for class imbalance

Best Use Cases:

  • Binary classification (2 classes)
  • Multi-label classification (multiple independent binary predictions)
  • When dealing with class imbalance (use pos_weight parameter)
  • Default choice for any binary classification task

PyTorch Implementation:

import torch
import torch.nn as nn

# Raw logits (NO sigmoid in model!)
logits = torch.tensor([[2.3], [-1.5], [3.0]])  # Raw model outputs
targets = torch.tensor([[1.0], [0.0], [1.0]])

bce_with_logits = nn.BCEWithLogitsLoss()
loss = bce_with_logits(logits, targets)
print(f"BCE with Logits: {loss.item():.4f}")

# Recommended model structure
class BinaryClassifierCorrect(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
            # NO sigmoid here!
        )
    
    def forward(self, x):
        return self.model(x)

# Use with BCEWithLogitsLoss
model = BinaryClassifierCorrect(10)
criterion = nn.BCEWithLogitsLoss()

Handling Class Imbalance:

# If positive class is rare (e.g., 1:10 ratio)
# Increase weight for positive class
pos_weight = torch.tensor([10.0])  # 10x weight for positive class
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

5. Cross-Entropy Loss

Mathematical Formula:

For multi-class classification with $C$ classes:

$$ \text{CrossEntropy} = -\sum_{i=1}^{N} \log\left(\frac{e^{z_{y_i}}}{\sum_{j=1}^{C} e^{z_j}}\right) $$

Where:

  • $z$ are raw logits (model outputs)
  • $y_i$ is the true class index
  • The formula includes softmax internally

Expanded Form:

$$ \text{CrossEntropy} = -\sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c}) $$

Where $y_{i,c}$ is 1 if sample $i$ belongs to class $c$, else 0.

Key Characteristics:

  • Combines softmax and negative log-likelihood
  • Expects raw logits (no softmax in model!)
  • Target can be class indices or probabilities

Advantages:

  • Most common multi-class classification loss
  • Numerically stable (log-sum-exp trick)
  • Probabilistic interpretation
  • Works well for most classification tasks
  • Supports class weights for imbalance

Disadvantages:

  • Assumes classes are mutually exclusive
  • Can struggle with severe class imbalance
  • Treats all misclassifications equally
  • May not be ideal for hierarchical classification

Best Use Cases:

  • Multi-class classification (>2 classes)
  • Image classification
  • Text classification
  • Any task with mutually exclusive classes

PyTorch Implementation:

import torch
import torch.nn as nn

# Raw logits for 3 classes (NO softmax in model!)
logits = torch.tensor([
    [2.0, 1.0, 0.1],  # Sample 1
    [0.5, 2.5, 1.0],  # Sample 2
    [1.0, 0.5, 3.0]   # Sample 3
])

# Targets are class indices
targets = torch.tensor([0, 1, 2])  # Ground truth classes

ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(logits, targets)
print(f"Cross-Entropy Loss: {loss.item():.4f}")

# Complete model example
class MultiClassClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, num_classes)
            # NO softmax here! CrossEntropyLoss applies it internally
        )
    
    def forward(self, x):
        return self.model(x)

# Training
model = MultiClassClassifier(input_size=20, num_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

# Forward pass
outputs = model(X_batch)  # Raw logits
loss = criterion(outputs, y_batch)  # Targets are class indices

Handling Class Imbalance:

# If classes are imbalanced [100, 50, 10 samples]
# Compute class weights inversely proportional to frequency
class_counts = torch.tensor([100, 50, 10])
class_weights = 1.0 / class_counts
class_weights = class_weights / class_weights.sum()  # Normalize

criterion = nn.CrossEntropyLoss(weight=class_weights)

6. Focal Loss

Mathematical Formula:

$$ \text{FocalLoss} = -\alpha_t (1 - p_t)^\gamma \log(p_t) $$

Where:

  • $p_t$ is the model's estimated probability for the true class
  • $\gamma$ is the focusing parameter ($\gamma \geq 0$)
  • $\alpha_t$ is a weighting factor (optional)

Key Characteristics:

  • Down-weights easy examples
  • Focuses training on hard, misclassified examples
  • $\gamma$ controls how much to down-weight easy examples

Advantages:

  • Excellent for severe class imbalance (e.g., object detection)
  • Focuses on hard-to-classify examples
  • Reduces impact of easy negatives
  • Used successfully in RetinaNet, YOLO variants

Disadvantages:

  • Requires tuning $\gamma$ and $\alpha$ hyperparameters
  • More complex than standard cross-entropy
  • Can be sensitive to hyperparameter choices
  • Not built into PyTorch (need custom implementation)

Best Use Cases:

  • Severe class imbalance (1:1000+ ratios)
  • Object detection
  • When one class dominates the dataset
  • Medical imaging (rare disease detection)

PyTorch Implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        """
        inputs: logits (raw model outputs)
        targets: class indices
        """
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        p_t = torch.exp(-ce_loss)  # Probability of true class
        focal_loss = self.alpha * (1 - p_t) ** self.gamma * ce_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

# Usage
model = MultiClassClassifier(input_size=20, num_classes=10)
criterion = FocalLoss(alpha=0.25, gamma=2.0)

outputs = model(X_batch)
loss = criterion(outputs, y_batch)

Comparison: Cross-Entropy vs Focal Loss:

# Easy example (high confidence, correct prediction)
easy_logits = torch.tensor([[5.0, 0.1, 0.1]])  # Very confident in class 0
easy_target = torch.tensor([0])

ce_easy = F.cross_entropy(easy_logits, easy_target)
focal_easy = FocalLoss()(easy_logits, easy_target)

# Hard example (low confidence, correct prediction)
hard_logits = torch.tensor([[0.6, 0.5, 0.4]])  # Uncertain
hard_target = torch.tensor([0])

ce_hard = F.cross_entropy(hard_logits, hard_target)
focal_hard = FocalLoss()(hard_logits, hard_target)

print("Easy Example:")
print(f"  CE Loss: {ce_easy:.4f}, Focal Loss: {focal_easy:.4f}")
print("Hard Example:")
print(f"  CE Loss: {ce_hard:.4f}, Focal Loss: {focal_hard:.4f}")

# Focal loss down-weights easy examples, focuses on hard ones

Loss Function Comparison Table

Loss Function Task Type Input Type Key Feature Best For
MSE Regression Continuous values Penalizes large errors heavily Clean data, no outliers
MAE Regression Continuous values Robust to outliers Noisy data, outliers present
Smooth L1 Regression Continuous values Hybrid MSE+MAE Object detection, robust regression
BCEWithLogits Binary Classification Logits (raw) Numerically stable Binary/multi-label classification
CrossEntropy Multi-class Logits (raw) Combines softmax+NLL Image/text classification
Focal Loss Classification (imbalanced) Logits (raw) Focuses on hard examples Severe class imbalance

Choosing the Right Loss Function

Decision Tree

graph TD
    A[What's your task?] --> B{Classification or Regression?}
    B -->|Regression| C{Data quality?}
    C -->|Clean, no outliers| D[MSE Loss]
    C -->|Has outliers| E[MAE or Smooth L1]
    C -->|Object detection| F[Smooth L1 Loss]
    
    B -->|Classification| G{How many classes?}
    G -->|Binary 2 classes| H{Class balance?}
    H -->|Balanced| I[BCEWithLogitsLoss]
    H -->|Imbalanced| J[BCEWithLogitsLoss<br/>+ pos_weight]
    
    G -->|Multi-class >2| K{Class balance?}
    K -->|Balanced| L[CrossEntropyLoss]
    K -->|Moderate imbalance| M[CrossEntropyLoss<br/>+ class weights]
    K -->|Severe imbalance<br/>1:100+| N[Focal Loss]
    
    style D fill:#4dabf7,stroke:#1971c2,stroke-width:2px
    style E fill:#4dabf7,stroke:#1971c2,stroke-width:2px
    style F fill:#4dabf7,stroke:#1971c2,stroke-width:2px
    style I fill:#51cf66,stroke:#2f9e44,stroke-width:2px
    style J fill:#51cf66,stroke:#2f9e44,stroke-width:2px
    style L fill:#51cf66,stroke:#2f9e44,stroke-width:2px
    style M fill:#51cf66,stroke:#2f9e44,stroke-width:2px
    style N fill:#ffd43b,stroke:#f59f00,stroke-width:2px

Quick Reference Guide

Scenario Recommended Loss Alternative
Regression, clean data MSE Smooth L1
Regression, outliers present MAE Huber Loss
Binary classification BCEWithLogitsLoss Focal Loss (imbalanced)
Multi-class classification CrossEntropyLoss Focal Loss (imbalanced)
Multi-label classification BCEWithLogitsLoss Focal Loss
Severe class imbalance (1:100+) Focal Loss Weighted CrossEntropy
Object detection Smooth L1 (bbox), Focal (class) -
Metric learning Triplet Loss Contrastive Loss



Common Mistakes and How to Avoid Them

Mistake 1: Using Sigmoid + CrossEntropyLoss

# ❌ WRONG: CrossEntropyLoss applies softmax internally
class WrongModel(nn.Module):
    def forward(self, x):
        x = self.layers(x)
        return F.softmax(x, dim=1)  # DON'T DO THIS

criterion = nn.CrossEntropyLoss()

# ✅ CORRECT: Return raw logits
class CorrectModel(nn.Module):
    def forward(self, x):
        return self.layers(x)  # Raw logits

criterion = nn.CrossEntropyLoss()  # Applies softmax internally

Mistake 2: Wrong Target Format

# ❌ WRONG: CrossEntropyLoss expects class indices, not one-hot
targets = torch.tensor([[1, 0, 0], [0, 1, 0]])  # One-hot encoded
loss = nn.CrossEntropyLoss()(logits, targets)  # ERROR!

# ✅ CORRECT: Use class indices
targets = torch.tensor([0, 1])  # Class indices
loss = nn.CrossEntropyLoss()(logits, targets)

Mistake 3: Ignoring Class Imbalance

# ❌ WRONG: Using standard loss with severe imbalance
criterion = nn.CrossEntropyLoss()  # Treats all classes equally

# ✅ CORRECT: Use class weights
class_weights = torch.tensor([0.1, 1.0, 5.0])  # Weight rare classes more
criterion = nn.CrossEntropyLoss(weight=class_weights)

# OR use Focal Loss for severe imbalance
criterion = FocalLoss(alpha=0.25, gamma=2.0)

Mistake 4: Wrong Reduction Mode

# ❌ WRONG: Using sum reduction with varying batch sizes
criterion = nn.MSELoss(reduction='sum')
# Loss magnitude changes with batch size!

# ✅ CORRECT: Use mean reduction (default)
criterion = nn.MSELoss(reduction='mean')
# Loss magnitude is consistent across batch sizes

Handling Class Imbalance

Class imbalance is one of the most common challenges in real-world classification tasks. Here are effective strategies:

Strategy 1: Class Weights

import torch
import torch.nn as nn
from collections import Counter

# Calculate class weights inversely proportional to frequency
def compute_class_weights(targets):
    counts = Counter(targets.numpy())
    total = len(targets)
    weights = {cls: total / count for cls, count in counts.items()}
    
    # Normalize
    max_weight = max(weights.values())
    weights = {cls: w / max_weight for cls, w in weights.items()}
    
    return torch.tensor([weights[i] for i in range(len(counts))])

# Usage
class_weights = compute_class_weights(train_targets)
criterion = nn.CrossEntropyLoss(weight=class_weights)

Strategy 2: Focal Loss

# Automatically focuses on hard examples
criterion = FocalLoss(alpha=0.25, gamma=2.0)

Strategy 3: Oversampling/Undersampling

from torch.utils.data import WeightedRandomSampler

# Oversample minority classes
class_counts = torch.bincount(train_targets)
class_weights = 1.0 / class_counts
sample_weights = class_weights[train_targets]

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True
)

train_loader = DataLoader(train_dataset, sampler=sampler, batch_size=32)

Custom Loss Functions

Sometimes you need a custom loss function tailored to your specific problem. Here's how to implement one:

Basic Custom Loss

import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, predictions, targets):
        # Your custom loss calculation
        loss = torch.mean((predictions - targets) ** 2)
        return loss

# Usage
criterion = CustomLoss()
loss = criterion(predictions, targets)

Advanced Custom Loss: Weighted MSE

class WeightedMSELoss(nn.Module):
    def __init__(self, sample_weights):
        super().__init__()
        self.sample_weights = sample_weights
    
    def forward(self, predictions, targets):
        squared_errors = (predictions - targets) ** 2
        weighted_errors = squared_errors * self.sample_weights.unsqueeze(1)
        return weighted_errors.mean()

# Usage
sample_weights = torch.tensor([1.0, 1.0, 2.0, 5.0])  # Weight important samples more
criterion = WeightedMSELoss(sample_weights)

Combined Loss Functions

class CombinedLoss(nn.Module):
    def __init__(self, alpha=0.5):
        super().__init__()
        self.alpha = alpha
        self.mse = nn.MSELoss()
        self.l1 = nn.L1Loss()
    
    def forward(self, predictions, targets):
        mse_loss = self.mse(predictions, targets)
        l1_loss = self.l1(predictions, targets)
        return self.alpha * mse_loss + (1 - self.alpha) * l1_loss

# Usage: Combine MSE (focus on large errors) and MAE (robustness)
criterion = CombinedLoss(alpha=0.7)

Practical Tips

1. Monitor Your Loss Curves

import matplotlib.pyplot as plt

# During training, track both losses
train_losses = []
val_losses = []

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss = validate(model, val_loader, criterion)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)

# Plot
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Val Loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')

What to Look For:

  • Both decreasing: Good, model is learning
  • Train decreasing, val increasing: Overfitting
  • Both high and flat: Learning rate too low or wrong loss function
  • Loss spiking: Learning rate too high or batch size too large

2. Normalize Your Targets

# For regression: Normalize targets to similar scale as predictions
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
y_train_scaled = scaler.fit_transform(y_train.reshape(-1, 1))

# Train with scaled targets
model.train()
predictions = model(X_train)
loss = criterion(predictions, torch.tensor(y_train_scaled))

# Don't forget to inverse transform predictions!
predictions_original = scaler.inverse_transform(predictions.detach().numpy())

3. Use Appropriate Metrics

Loss is for training, but use task-specific metrics for evaluation:

from sklearn.metrics import accuracy_score, f1_score, mean_absolute_error

# Classification
predictions = model(X_test).argmax(dim=1)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average='weighted')

# Regression
predictions = model(X_test)
mae = mean_absolute_error(y_test, predictions.detach().numpy())

Conclusion

Loss functions are the guiding force behind neural network training, translating the abstract goal of "better predictions" into concrete numerical objectives that optimization algorithms can minimize. This guide covered the six most essential loss functions that handle the vast majority of deep learning tasks:

For Regression:

  • MSE: Your default for clean data
  • MAE: When outliers are present
  • Smooth L1: The robust choice for computer vision tasks

For Classification:

  • BCEWithLogitsLoss: Binary classification made stable
  • CrossEntropyLoss: The multi-class workhorse
  • Focal Loss: Your weapon against severe class imbalance

Remember these key principles:

  • Match the loss to your task: Regression uses MSE/MAE, classification uses CrossEntropy/BCEWithLogits
  • Consider your data distribution: Outliers, class imbalance, and noise all influence loss selection
  • Avoid common mistakes: Never use sigmoid with CrossEntropyLoss, always return raw logits from your model
  • Monitor and adapt: Watch your loss curves, and don't be afraid to switch loss functions if training stalls

Start with the standard choices (MSE for regression, CrossEntropyLoss for classification), then refine based on your specific challenges. The right loss function can mean the difference between a model that struggles to converge and one that achieves state-of-the-art performance.