PyTorch Deep Learning Workflow: From Data to Deployment

Building a deep learning model can feel overwhelming with so many moving parts—datasets, dataloaders, loss functions, optimizers, and training loops. This guide breaks down the entire PyTorch workflow into 7 clear, actionable steps that take you from raw data to a deployed model.

Whether you're training your first neural network or refining your understanding of PyTorch fundamentals, this structured approach will help you build models systematically and avoid common pitfalls.

Hands-on Example: Google Colab Notebook - Image Classification


The 7-Step PyTorch Workflow

Every PyTorch project follows this systematic pipeline:

  1. DATA → Dataset → Dataloaders
  2. Model Definition → Neural network architecture
  3. Loss Function → Measure prediction error
  4. Optimizer → Update weights to minimize loss
  5. Training Loop → Iterate and improve
  6. Evaluation → Validate and test performance
  7. Deployment → Save and use the model
flowchart TD
    A[Problem Definition] --> B[Raw Data]
    B --> C[Dataset<br/>__len__ & __getitem__]
    C --> D[DataLoader<br/>Batching & Shuffling]
    D --> E[Model<br/>nn.Module]
    E --> F[Loss Function]
    F --> G[Optimizer]
    G --> H[Training Loop]
    H --> I[Validation]
    I --> J[Testing]
    J --> K[Saved Model]
    K --> L[Inference / Deployment]


A. DATA

  1. Dataset: PyTorch doesn’t iterate through NumPy arrays or CSV files rather it expects tensors. Raw data might be: images, tabular data, time series, audio, etc. Pytorch expects a Dataset object through a Dataset abstraction..

    Dataset must be able to:

    1. Give how many samples it has: __len__
    2. Return one sample at a time: __getitem__
    class MyDataset(Dataset):
        def __init__(self, X, Y):
            self.X = torch.tensor(X, dtype=torch.float32)
            self.Y = torch.tensor(Y, dtype=torch.float32)
    
        def __len__(self):
            return len(self.X)
    
        def __getitem__(self, index):
            return self.X[index], self.Y[index]
    
    
  2. DataLoaders: The DataLoader is responsible for feeding data efficiently to the model. PyTorch’s DataLoader does batching which reduces the effort to manually loop through samples ‘n’ at a time. It also do shuffling(better training), loads in parallel, provides batches in a clean loop.

    Note:

    • shuffle=True is used only for training to introduce randomness and improve generalization
    • Validation and test loaders should use shuffle=False for stable, reproducible metrics
    • DataLoader returns CPU tensors; moving data to GPU is handled inside the training loop

    Dataset = how to fetch data

    DataLoader = how to feed data to the model efficiently

    batch1 = dataset[0:32] # 32 samples at one time
    batch2 = dataset[32:64]
    ...
    
    training_data = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
    
flowchart LR
    A[Raw Data<br/>Files / DB / CSV] --> B[Dataset]
    B -->|One sample| C[DataLoader]
    C -->|Batch of samples| D[Training Loop]

    B:::dataset
    C:::loader

    classDef dataset fill:#e3f2fd,stroke:#1e88e5
    classDef loader fill:#f1f8e9,stroke:#7cb342


Step 1: Preparing Data with Dataset and DataLoader

PyTorch doesn't iterate through raw NumPy arrays or CSV files—it expects tensors wrapped in a Dataset abstraction. This standardized approach works for any data type: images, tabular data, time series, audio, or text.

1.1 Data Splitting: Train, Validation, Test

Before creating datasets, split your raw data into three subsets to prevent data leakage and get reliable performance metrics:

from sklearn.model_selection import train_test_split

# Split: 70% train, 15% validation, 15% test
train_data, temp_data, train_labels, temp_labels = train_test_split(
    X, y, test_size=0.3, random_state=42
)

val_data, test_data, val_labels, test_labels = train_test_split(
    temp_data, temp_labels, test_size=0.5, random_state=42
)
Split Size Purpose When to Use
Training 60-80% Update model weights Every epoch during training
Validation 10-20% Monitor overfitting, tune hyperparameters After each epoch
Test 10-20% Final performance evaluation Once at the very end

Critical Note: Never use test data during training or hyperparameter tuning—it must remain completely unseen.

1.2 Creating a Custom Dataset

A PyTorch Dataset must implement two essential methods:

  1. __len__ → Returns the total number of samples
  2. __getitem__ → Returns one sample at a given index

Basic Example (Tabular Data):

from torch.utils.data import Dataset
import torch

class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create datasets for each split
train_dataset = TabularDataset(train_data, train_labels)
val_dataset = TabularDataset(val_data, val_labels)
test_dataset = TabularDataset(test_data, test_labels)

Advanced Example (Images with Transforms):

from torchvision import transforms
from PIL import Image

class ImageDataset(Dataset):
    def __init__(self, image_paths, labels, transform=None):
        self.image_paths = image_paths
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        # Load image
        image = Image.open(self.image_paths[idx]).convert('RGB')
        label = self.labels[idx]
        
        # Apply transforms (resize, normalize, augment)
        if self.transform:
            image = self.transform(image)
        
        return image, label

# Define transforms
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),      # Data augmentation
    transforms.RandomRotation(10),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

val_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

train_dataset = ImageDataset(train_paths, train_labels, transform=train_transform)
val_dataset = ImageDataset(val_paths, val_labels, transform=val_transform)

Why Transforms?

  • Normalization: Speeds up training by scaling pixel values to a standard range
  • Augmentation: Creates variations (flips, rotations) to improve generalization
  • Consistency: Ensures all images have the same dimensions

1.3 DataLoader: Efficient Batch Processing

The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading automatically.

from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_dataset,
    batch_size=32,        # Number of samples per batch
    shuffle=True,         # Randomize order each epoch
    num_workers=4,        # Parallel data loading
    pin_memory=True       # Faster GPU transfer (if using CUDA)
)

val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Key Parameters Explained:

Parameter Typical Value Purpose Notes
batch_size 32, 64, 128 Number of samples processed together Larger = more GPU memory, faster training, less stable
shuffle True (train only) Randomize sample order Prevents model from learning order patterns
num_workers 4-8 Parallel CPU threads for loading Set to 0 for debugging; higher = faster loading
pin_memory True (if GPU) Speeds up CPU-to-GPU transfer Only use when training on GPU

How to Choose Batch Size:

# Rule of thumb: Start with 32, then increase until GPU memory is ~80% full
# Check GPU memory usage:
# nvidia-smi (in terminal)

# Too large → Out of memory error
# Too small → Slow training, unstable gradients
# Just right → Fast training with stable convergence

Understanding num_workers:

# num_workers=0  → Main process loads data (slower, but good for debugging)
# num_workers=4  → 4 parallel workers load data while GPU trains (faster)
# num_workers=8  → Even faster, but diminishing returns beyond CPU core count

1.4 Common Pitfalls and How to Avoid Them

Mistake 1: Not Converting to Tensors

# Wrong
def __getitem__(self, idx):
    return self.data[idx]  # Returns NumPy array or list

# Correct
def __getitem__(self, idx):
    return torch.tensor(self.data[idx], dtype=torch.float32)

Mistake 2: Data Leakage (Using Test Data During Training)

# Wrong: Using test data to normalize training data
scaler.fit(X_test)  # DON'T DO THIS

# Correct: Fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Mistake 3: Shuffling Validation/Test Data

# Wrong
val_loader = DataLoader(val_dataset, shuffle=True)  # Unnecessary and inconsistent

# Correct
val_loader = DataLoader(val_dataset, shuffle=False)  # Reproducible metrics

Mistake 4: Dimension Mismatches

# Wrong: Shape (batch, features) but model expects (batch, channels, height, width)
return self.images[idx]  # Shape: (224, 224, 3)

# Correct: Transpose to (channels, height, width)
image = self.images[idx].transpose(2, 0, 1)  # Shape: (3, 224, 224)
# Or use transforms.ToTensor() which handles this automatically

The Complete Data Pipeline

flowchart TB
    A[Raw Data] --> B[Train/Val/Test Split]
    B --> C1[Train Dataset]
    B --> C2[Val Dataset]
    B --> C3[Test Dataset]
    
    C1 --> D1[Train DataLoader<br/>batch_size=32<br/>shuffle=True<br/>num_workers=4]
    C2 --> D2[Val DataLoader<br/>batch_size=32<br/>shuffle=False]
    C3 --> D3[Test DataLoader<br/>batch_size=32<br/>shuffle=False]
    
    D1 --> E[Training Loop]
    D2 --> F[Validation Loop]
    D3 --> G[Final Testing]
    
    style C1 fill:#e3f2fd,stroke:#1e88e5
    style C2 fill:#e3f2fd,stroke:#1e88e5
    style C3 fill:#e3f2fd,stroke:#1e88e5
    style D1 fill:#f1f8e9,stroke:#7cb342
    style D2 fill:#f1f8e9,stroke:#7cb342
    style D3 fill:#f1f8e9,stroke:#7cb342

Remember: Dataset = how to fetch one sample | DataLoader = how to feed batches efficiently


Step 2: Building the Model

class MyModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )

    def forward(self, x):
        return self.layers(x)

In PyTorch, all neural networks inherit from nn.Module. This base class handles parameter management, device movement, and training/evaluation modes automatically. Your job is to define the architecture in __init__ and specify how data flows through the model in forward().

2.1 Understanding nn.Module

Every PyTorch model follows this structure:

import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()  # Initialize parent class (REQUIRED)
        # Define layers here
    
    def forward(self, x):
        # Define how data flows through layers
        return output

Key Concepts:

Component Purpose When It Runs
super().__init__() Initializes nn.Module functionality Once during model creation
__init__ Defines what layers exist Once during model creation
forward() Defines computation graph Every time you pass data through model

Why inherit from nn.Module?

  • Automatic parameter tracking (.parameters())
  • Easy device management (.to(device))
  • Training/eval mode switching (.train(), .eval())
  • State saving/loading (.state_dict())

2.2 Basic Model Architecture

Simple Feedforward Network (Tabular Data):

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.fc1(x)      # Input → Hidden layer
        x = self.relu(x)     # Apply activation
        x = self.fc2(x)      # Hidden → Output layer
        return x

# Instantiate model
model = SimpleNN(input_size=784, hidden_size=128, output_size=10)

# Check architecture
print(model)
# Output:
# SimpleNN(
#   (fc1): Linear(in_features=784, out_features=128, bias=True)
#   (relu): ReLU()
#   (fc2): Linear(in_features=128, out_features=10, bias=True)
# )

Using Sequential (More Compact):

class SimpleNNSequential(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.model = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),                      # Regularization
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )
    
    def forward(self, x):
        return self.model(x)

When to use Sequential vs. Explicit layers:

  • Sequential: Simple linear flow (A → B → C)
  • Explicit layers: Complex connections (skip connections, multiple branches)

2.3 Common Layer Types

Layer Purpose Typical Use Case Input Shape Example
nn.Linear(in, out) Fully connected layer Tabular data, final classifier (batch, features)
nn.Conv2d(in, out, kernel) 2D convolution Image feature extraction (batch, channels, H, W)
nn.LSTM(in, hidden) Recurrent layer Sequences, time series (seq_len, batch, features)
nn.Dropout(p) Random neuron disabling Prevent overfitting Any shape
nn.BatchNorm2d(channels) Normalize activations Stable training (batch, C, H, W)
nn.Embedding(vocab, dim) Word vectors NLP tasks (batch, seq_len)

2.4 Activation Functions

Activation functions introduce non-linearity, allowing networks to learn complex patterns. Without them, stacking layers would be equivalent to a single linear transformation.

📚 Want to dive deeper into activation functions? Read our comprehensive guide: Activation Functions: The Neurons' Decision Makers covering mathematical formulas, advantages/disadvantages, weight initialization strategies, activation-loss pairings, and practical PyTorch examples.

# Common activation functions
activations = {
    'ReLU': nn.ReLU(),           # Most common: max(0, x)
    'LeakyReLU': nn.LeakyReLU(), # Fixes dying ReLU: max(0.01x, x)
    'Tanh': nn.Tanh(),           # Range: [-1, 1]
    'Sigmoid': nn.Sigmoid(),     # Range: [0, 1]
    'Softmax': nn.Softmax(dim=1) # Probability distribution
}
Activation Range Best For Downsides
ReLU [0, ∞) Hidden layers (most tasks) Dying ReLU (neurons get stuck at 0)
LeakyReLU (-∞, ∞) When ReLU causes dead neurons Slightly more computation
Tanh [-1, 1] RNNs, normalized outputs Vanishing gradient
Sigmoid [0, 1] Binary classification output Vanishing gradient
Softmax [0, 1] (sum=1) Multi-class output layer Use with CrossEntropyLoss

Dying ReLU Problem:

# If ReLU input is always negative, gradient becomes 0
# Neuron never updates → "dead neuron"

# Solution: Use LeakyReLU or ensure proper initialization
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.LeakyReLU(0.01),  # Allows small negative gradient
    nn.Linear(50, 10)
)

2.5 Model Architectures for Different Tasks

Convolutional Neural Network (Image Classification):

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Convolutional feature extractor
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # 3 RGB channels → 32 filters
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                          # Downsample by 2
            
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        
        # Fully connected classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 512),  # Depends on input size
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

LSTM for Sequence Data (Text, Time Series):

class SequenceModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)        # (batch, seq_len, embedding_dim)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        output = self.fc(hidden[-1])        # Use final hidden state
        return output

Transfer Learning (Using Pretrained Models):

from torchvision import models

class TransferLearningModel(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        
        # Load pretrained ResNet
        self.backbone = models.resnet18(pretrained=True)
        
        # Freeze early layers
        for param in self.backbone.parameters():
            param.requires_grad = False
        
        # Replace final layer
        num_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Linear(num_features, num_classes)
    
    def forward(self, x):
        return self.backbone(x)

2.6 Device Management (CPU vs GPU)

Always move your model to the appropriate device before training:

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move model to device
model = SimpleNN(784, 128, 10)
model = model.to(device)

# Data must also be moved to the same device during training
# This happens in the training loop:
for X_batch, y_batch in train_loader:
    X_batch = X_batch.to(device)
    y_batch = y_batch.to(device)
    # ... training code

Multi-GPU Training:

if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)
model = model.to(device)

2.7 Model Inspection & Debugging

Count Parameters:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_params = count_parameters(model)
print(f"Total trainable parameters: {total_params:,}")

Check Output Shape:

# Create dummy input to verify dimensions
dummy_input = torch.randn(1, 3, 224, 224).to(device)  # (batch=1, channels=3, H=224, W=224)
output = model(dummy_input)
print(f"Output shape: {output.shape}")  # Should match expected output

Layer-by-Layer Inspection:

# Print each layer's output shape
def print_shapes(model, input_size):
    x = torch.randn(input_size)
    for name, layer in model.named_children():
        x = layer(x)
        print(f"{name}: {x.shape}")

print_shapes(model, (1, 3, 224, 224))

2.8 Common Pitfalls and Solutions

Mistake 1: Forgetting super().init()

# Wrong
class MyModel(nn.Module):
    def __init__(self):
        self.fc = nn.Linear(10, 5)  # Missing super().__init__()

# Correct
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()  # REQUIRED
        self.fc = nn.Linear(10, 5)

Mistake 2: Dimension Mismatches

# Wrong: Output of fc1 (128) doesn't match input of fc2 (256)
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(256, 10)  # Expects 256 inputs!

# Correct: Dimensions must align
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)  # Matches fc1 output

Mistake 3: Not Matching Model Input with Data

# Data shape: (batch, 28, 28) - 2D image
# Model expects: (batch, 784) - flattened

# Solution: Flatten in forward()
def forward(self, x):
    x = x.view(x.size(0), -1)  # Flatten: (batch, 28, 28) → (batch, 784)
    x = self.fc1(x)
    return x

Mistake 4: Using Softmax with CrossEntropyLoss

# Wrong: CrossEntropyLoss applies softmax internally
def forward(self, x):
    x = self.fc(x)
    return F.softmax(x, dim=1)  # DON'T DO THIS

# Correct: Return raw logits
def forward(self, x):
    return self.fc(x)  # Let CrossEntropyLoss handle softmax

Mistake 5: Training Mode in Evaluation

# Wrong: Dropout and BatchNorm behave differently in train vs eval mode
predictions = model(test_data)  # May still be in training mode

# Correct: Set to evaluation mode
model.eval()
with torch.no_grad():
    predictions = model(test_data)

Architecture Design Guidelines

flowchart LR
    A[Input Layer] --> B[Hidden Layers]
    B --> C[Output Layer]
    
    B --> D[Gradually Decrease<br/>Width]
    B --> E[Add Dropout<br/>for Regularization]
    B --> F[Use BatchNorm<br/>for Stability]
    
    style A fill:#e3f2fd,stroke:#1e88e5
    style C fill:#e3f2fd,stroke:#1e88e5
    style B fill:#f1f8e9,stroke:#7cb342

Common Patterns:

Pattern Architecture Use Case
Pyramid 784 → 512 → 256 → 128 → 10 Gradually compress information
Hourglass 100 → 50 → 25 → 50 → 100 Autoencoders, compression
Residual Skip connections around blocks Very deep networks (ResNet)
Wide & Shallow 100 → 500 → 10 Simple patterns, fast training
Narrow & Deep 100 → 80 → 60 → 40 → 20 → 10 Complex patterns, risk overfitting

Remember: Start simple, add complexity only if needed. A smaller model that generalizes well beats a massive overfitted model.


Step 3: Loss Function

Understanding Loss: Measuring Model Performance

A loss function quantifies the gap between your model's predictions and the actual targets. It converts this error into a single numerical value that the optimizer can minimize.

Core Principle: Lower loss = Better predictions | Higher loss = Worse predictions

📚 Want to dive deeper into loss functions? Read our comprehensive guide: Loss Functions: Measuring Model Performance covering mathematical foundations, detailed explanations of regression and classification losses, handling class imbalance, and implementing custom loss functions.

Common Loss Functions

Task Type Loss Function Purpose Example Use Case
Regression MSELoss (Mean Squared Error) Penalizes large errors heavily House price prediction, temperature forecasting
Multi-class Classification CrossEntropyLoss Measures confidence in correct class Image classification (dog/cat/bird)
Binary Classification BCEWithLogitsLoss Optimizes binary decisions Spam detection, disease diagnosis
Object Detection SmoothL1Loss Robust to outliers Bounding box regression

Example Implementation:

# For multi-class classification
criterion = nn.CrossEntropyLoss()

# For regression tasks
criterion = nn.MSELoss()

# For binary classification
criterion = nn.BCEWithLogitsLoss()

Step 4: Optimizer

How Models Learn: Weight Updates

The optimizer adjusts model parameters (weights and biases) to minimize the loss function. It determines how the model learns from its mistakes.

Key Relationship:

  • Loss function → "How wrong am I?"
  • Optimizer → "How should I fix it?"

Popular Optimizers

Optimizer Best For Pros Cons
SGD Large datasets, simple problems Memory efficient, well-understood Can be slow, needs careful tuning
Adam Most deep learning tasks Adaptive learning rates, fast convergence More memory, can overfit small datasets
AdamW Transformers, modern architectures Better regularization than Adam Slightly more complex
RMSprop Recurrent networks Good for non-stationary objectives Less popular than Adam

Example Setup:

import torch.optim as optim

# Adam optimizer (most common starting point)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# SGD with momentum (for large-scale training)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# AdamW (for transformer models)
optimizer = optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)

Understanding Learning Rate

The learning rate (lr) controls how much the model's weights change with each update. It's one of the most critical hyperparameters.

Weight Update Formula

new_weight = old_weight - lr × gradient

Learning Rate Guidelines

Learning Rate Typical Range Training Behavior Visual Symptoms When to Use
Too High 0.1+ (Adam), 1.0+ (SGD) Chaotic, unstable jumps Loss spikes, NaN values, zigzag pattern Never (unless using lr schedulers)
Optimal 1e-3 to 1e-4 (Adam), 1e-2 to 1e-1 (SGD) Smooth, steady improvement Consistently decreasing loss curve Start here, adjust based on results
Too Low 1e-6 (Adam), 1e-5 (SGD) Glacially slow progress Flat loss curve, minimal change per epoch When fine-tuning pretrained models

Visual Analogy: Walking Down a Hill

Think of training as finding the lowest point in a valley:

  • High learning rate → Taking giant leaps, might jump over the valley
  • Optimal learning rate → Steady, purposeful steps downward
  • Low learning rate → Tiny baby steps, takes forever to reach the bottom

Step 5: Training Loop

flowchart LR
    A[Input Batch] --> B[Forward Pass]
    B --> C[Compute Loss]
    C --> D[Zero Gradients]
    D --> E[Backward Pass]
    E --> F[Update Weights]
    F --> G{More Batches?}
    G -->|Yes| A
    G -->|No| H[Validation]

Complete Training Loop

def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()  # Set model to training mode
    total_loss = 0
    
    for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
        # 1. Move data to device (GPU/CPU)
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        # 2. Zero gradients from previous iteration
        optimizer.zero_grad()
        
        # 3. Forward pass: compute predictions
        predictions = model(X_batch)
        
        # 4. Compute loss
        loss = criterion(predictions, y_batch)
        
        # 5. Backward pass: compute gradients
        loss.backward()
        
        # 6. Update weights using gradients
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)

Key Operations:

Operation Purpose Why It's Needed
optimizer.zero_grad() Clear old gradients Gradients accumulate by default; we need fresh gradients each iteration
loss.backward() Compute gradients Calculates how each weight contributed to the loss (via chain rule)
optimizer.step() Update weights Applies the computed gradients to adjust model parameters
model.to(device) Move to GPU Accelerates computation 10-100× on compatible hardware
model.train() Enable training mode Activates dropout, batch norm training behavior

Step 6: Validation and Testing

Three Distinct Phases

flowchart TD
    A[Full Dataset] --> B[Training Set 70%]
    A --> C[Validation Set 15%]
    A --> D[Test Set 15%]
    
    B --> E[Update Weights]
    C --> F[Tune Hyperparameters<br/>Monitor Overfitting]
    D --> G[Final Evaluation<br/>Report Results]
    
    E --> H{Epoch Complete?}
    H -->|Yes| F
    H -->|No| E
    F --> I{Training Complete?}
    I -->|No| E
    I -->|Yes| G

Validation During Training

Purpose: Catch overfitting early and guide hyperparameter tuning

def validate(model, val_loader, criterion, device):
    model.eval()  # Disable dropout, fix batch norm
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():  # Save memory, speed up computation
        for X_val, y_val in val_loader:
            X_val, y_val = X_val.to(device), y_val.to(device)
            
            predictions = model(X_val)
            loss = criterion(predictions, y_val)
            
            total_loss += loss.item()
            
            # For classification: compute accuracy
            _, predicted = torch.max(predictions, 1)
            total += y_val.size(0)
            correct += (predicted == y_val).sum().item()
    
    accuracy = 100 * correct / total
    avg_loss = total_loss / len(val_loader)
    
    return avg_loss, accuracy

Testing After Training

Purpose: Unbiased evaluation of final model performance

def test(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    all_predictions = []
    all_targets = []
    
    with torch.no_grad():
        for X_test, y_test in test_loader:
            X_test, y_test = X_test.to(device), y_test.to(device)
            
            predictions = model(X_test)
            _, predicted = torch.max(predictions, 1)
            
            all_predictions.extend(predicted.cpu().numpy())
            all_targets.extend(y_test.cpu().numpy())
            
            total += y_test.size(0)
            correct += (predicted == y_test).sum().item()
    
    # Compute comprehensive metrics
    from sklearn.metrics import classification_report
    print(classification_report(all_targets, all_predictions))
    
    return 100 * correct / total
Phase When Purpose Affects Training?
Training Every epoch Update weights to minimize loss Yes
Validation After each epoch Monitor overfitting, tune hyperparameters Indirectly (early stopping, learning rate adjustment)
Testing Once, at the end Report final performance No
flowchart LR
    D[Dataset] --> T[Training Loader<br/>shuffle=True]
    D --> V[Validation Loader<br/>shuffle=False]
    D --> TS[Test Loader<br/>shuffle=False]

    T --> M[Model Training]
    V --> E[Monitor Performance<br/>During Training]
    TS --> F[Final Evaluation<br/>After Training]


Step 7: Saving Model and Inference**

After training we can save the weights then we can use it to reload model later, avoid retraining, and deploy it.

# ✓ RECOMMENDED: Save only weights (portable, version-safe)
torch.save(model.state_dict(), 'model_weights.pth')

# Load weights later
model = YourModelClass()  # Must define architecture first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# ✓ GOOD: Save checkpoint with training state
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
    'accuracy': accuracy
}
torch.save(checkpoint, 'checkpoint.pth')

# ✗ NOT RECOMMENDED: Save entire model (fragile, version-dependent)
torch.save(model, 'entire_model.pth')  # Breaks if PyTorch version changes

Model Saving Strategies

flowchart TD
    A[Training Complete] --> B{What to Save?}
    
    B -->|state_dict only| C[✓ Portable<br/>✓ Version-safe<br/>✓ Recommended]
    B -->|Entire model| D[✗ Version-dependent<br/>✗ Fragile<br/>✗ Not recommended]
    B -->|Full checkpoint| E[✓ Resume training<br/>✓ Track progress<br/>✓ Best for experiments]
    
    C --> F[Production Deployment]
    E --> G[Research & Development]
    D --> H[Legacy Code Only]

Inference Mode

# Prepare model for production
model.eval()  # Disable dropout, batch norm training mode

# Make predictions
with torch.no_grad():  # Disable gradient computation
    input_tensor = preprocess(input_data)
    input_tensor = input_tensor.to(device)
    
    output = model(input_tensor)
    prediction = torch.argmax(output, dim=1)

print(f"Predicted class: {prediction.item()}")

Quick Reference: Full Training Script

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = YourModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 50
best_val_loss = float('inf')

for epoch in range(num_epochs):
    # Training phase
    train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    
    # Validation phase
    val_loss, val_acc = validate(model, val_loader, criterion, device)
    
    print(f'Epoch {epoch+1}/{num_epochs}')
    print(f'Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}%')
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pth')

# Final testing
test_acc = test(model, test_loader, device)
print(f'Test Accuracy: {test_acc:.2f}%')