PyTorch Deep Learning Workflow: From Data to Deployment

January 02, 2026

by Arpit Pathak

pytorch machine learning training deep learning neural networks

Building a deep learning model can feel overwhelming with so many moving parts—datasets, dataloaders, loss functions, optimizers, and training loops. This guide breaks down the entire PyTorch workflow into 7 clear, actionable steps that take you from raw data to a deployed model.

Whether you're training your first neural network or refining your understanding of PyTorch fundamentals, this structured approach will help you build models systematically and avoid common pitfalls.

Hands-on Example: Google Colab Notebook - Image Classification

The 7-Step PyTorch Workflow

Every PyTorch project follows this systematic pipeline:

DATA → Dataset → Dataloaders
Model Definition → Neural network architecture
Loss Function → Measure prediction error
Optimizer → Update weights to minimize loss
Training Loop → Iterate and improve
Evaluation → Validate and test performance
Deployment → Save and use the model

flowchart TD
    A[Problem Definition] --> B[Raw Data]
    B --> C[Dataset<br/>__len__ & __getitem__]
    C --> D[DataLoader<br/>Batching & Shuffling]
    D --> E[Model<br/>nn.Module]
    E --> F[Loss Function]
    F --> G[Optimizer]
    G --> H[Training Loop]
    H --> I[Validation]
    I --> J[Testing]
    J --> K[Saved Model]
    K --> L[Inference / Deployment]

A. DATA

Dataset: PyTorch doesn’t iterate through NumPy arrays or CSV files rather it expects tensors. Raw data might be: images, tabular data, time series, audio, etc. Pytorch expects a Dataset object through a Dataset abstraction..

Dataset must be able to:
1. Give how many samples it has: __len__
2. Return one sample at a time: __getitem__
```
class MyDataset(Dataset):
    def __init__(self, X, Y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.Y = torch.tensor(Y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, index):
        return self.X[index], self.Y[index]
```
DataLoaders: The DataLoader is responsible for feeding data efficiently to the model. PyTorch’s DataLoader does batching which reduces the effort to manually loop through samples ‘n’ at a time. It also do shuffling(better training), loads in parallel, provides batches in a clean loop.

Note:
- shuffle=True is used only for training to introduce randomness and improve generalization
- Validation and test loaders should use shuffle=False for stable, reproducible metrics
- DataLoader returns CPU tensors; moving data to GPU is handled inside the training loop
Dataset = how to fetch data

DataLoader = how to feed data to the model efficiently
```
batch1 = dataset[0:32] # 32 samples at one time
batch2 = dataset[32:64]
...

training_data = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
```

flowchart LR
    A[Raw Data<br/>Files / DB / CSV] --> B[Dataset]
    B -->|One sample| C[DataLoader]
    C -->|Batch of samples| D[Training Loop]

    B:::dataset
    C:::loader

    classDef dataset fill:#e3f2fd,stroke:#1e88e5
    classDef loader fill:#f1f8e9,stroke:#7cb342

Step 1: Preparing Data with Dataset and DataLoader

PyTorch doesn't iterate through raw NumPy arrays or CSV files—it expects tensors wrapped in a Dataset abstraction. This standardized approach works for any data type: images, tabular data, time series, audio, or text.

1.1 Data Splitting: Train, Validation, Test

Before creating datasets, split your raw data into three subsets to prevent data leakage and get reliable performance metrics:

from sklearn.model_selection import train_test_split

# Split: 70% train, 15% validation, 15% test
train_data, temp_data, train_labels, temp_labels = train_test_split(
    X, y, test_size=0.3, random_state=42
)

val_data, test_data, val_labels, test_labels = train_test_split(
    temp_data, temp_labels, test_size=0.5, random_state=42
)

Split	Size	Purpose	When to Use
Training	60-80%	Update model weights	Every epoch during training
Validation	10-20%	Monitor overfitting, tune hyperparameters	After each epoch
Test	10-20%	Final performance evaluation	Once at the very end

Critical Note: Never use test data during training or hyperparameter tuning—it must remain completely unseen.

1.2 Creating a Custom Dataset

A PyTorch Dataset must implement two essential methods:

__len__ → Returns the total number of samples
__getitem__ → Returns one sample at a given index

Basic Example (Tabular Data):

from torch.utils.data import Dataset
import torch

class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create datasets for each split
train_dataset = TabularDataset(train_data, train_labels)
val_dataset = TabularDataset(val_data, val_labels)
test_dataset = TabularDataset(test_data, test_labels)

Advanced Example (Images with Transforms):

from torchvision import transforms
from PIL import Image

class ImageDataset(Dataset):
    def __init__(self, image_paths, labels, transform=None):
        self.image_paths = image_paths
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        # Load image
        image = Image.open(self.image_paths[idx]).convert('RGB')
        label = self.labels[idx]
        
        # Apply transforms (resize, normalize, augment)
        if self.transform:
            image = self.transform(image)
        
        return image, label

# Define transforms
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),      # Data augmentation
    transforms.RandomRotation(10),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

val_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

train_dataset = ImageDataset(train_paths, train_labels, transform=train_transform)
val_dataset = ImageDataset(val_paths, val_labels, transform=val_transform)

Why Transforms?

Normalization: Speeds up training by scaling pixel values to a standard range
Augmentation: Creates variations (flips, rotations) to improve generalization
Consistency: Ensures all images have the same dimensions

1.3 DataLoader: Efficient Batch Processing

The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading automatically.

from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_dataset,
    batch_size=32,        # Number of samples per batch
    shuffle=True,         # Randomize order each epoch
    num_workers=4,        # Parallel data loading
    pin_memory=True       # Faster GPU transfer (if using CUDA)
)

val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Key Parameters Explained:

Parameter	Typical Value	Purpose	Notes
`batch_size`	32, 64, 128	Number of samples processed together	Larger = more GPU memory, faster training, less stable
`shuffle`	`True` (train only)	Randomize sample order	Prevents model from learning order patterns
`num_workers`	4-8	Parallel CPU threads for loading	Set to 0 for debugging; higher = faster loading
`pin_memory`	`True` (if GPU)	Speeds up CPU-to-GPU transfer	Only use when training on GPU

How to Choose Batch Size:

# Rule of thumb: Start with 32, then increase until GPU memory is ~80% full
# Check GPU memory usage:
# nvidia-smi (in terminal)

# Too large → Out of memory error
# Too small → Slow training, unstable gradients
# Just right → Fast training with stable convergence

Understanding num_workers:

# num_workers=0  → Main process loads data (slower, but good for debugging)
# num_workers=4  → 4 parallel workers load data while GPU trains (faster)
# num_workers=8  → Even faster, but diminishing returns beyond CPU core count

1.4 Common Pitfalls and How to Avoid Them

Mistake 1: Not Converting to Tensors

# Wrong
def __getitem__(self, idx):
    return self.data[idx]  # Returns NumPy array or list

# Correct
def __getitem__(self, idx):
    return torch.tensor(self.data[idx], dtype=torch.float32)

Mistake 2: Data Leakage (Using Test Data During Training)

# Wrong: Using test data to normalize training data
scaler.fit(X_test)  # DON'T DO THIS

# Correct: Fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Mistake 3: Shuffling Validation/Test Data

# Wrong
val_loader = DataLoader(val_dataset, shuffle=True)  # Unnecessary and inconsistent

# Correct
val_loader = DataLoader(val_dataset, shuffle=False)  # Reproducible metrics

Mistake 4: Dimension Mismatches

# Wrong: Shape (batch, features) but model expects (batch, channels, height, width)
return self.images[idx]  # Shape: (224, 224, 3)

# Correct: Transpose to (channels, height, width)
image = self.images[idx].transpose(2, 0, 1)  # Shape: (3, 224, 224)
# Or use transforms.ToTensor() which handles this automatically

The Complete Data Pipeline

flowchart TB
    A[Raw Data] --> B[Train/Val/Test Split]
    B --> C1[Train Dataset]
    B --> C2[Val Dataset]
    B --> C3[Test Dataset]
    
    C1 --> D1[Train DataLoader<br/>batch_size=32<br/>shuffle=True<br/>num_workers=4]
    C2 --> D2[Val DataLoader<br/>batch_size=32<br/>shuffle=False]
    C3 --> D3[Test DataLoader<br/>batch_size=32<br/>shuffle=False]
    
    D1 --> E[Training Loop]
    D2 --> F[Validation Loop]
    D3 --> G[Final Testing]
    
    style C1 fill:#e3f2fd,stroke:#1e88e5
    style C2 fill:#e3f2fd,stroke:#1e88e5
    style C3 fill:#e3f2fd,stroke:#1e88e5
    style D1 fill:#f1f8e9,stroke:#7cb342
    style D2 fill:#f1f8e9,stroke:#7cb342
    style D3 fill:#f1f8e9,stroke:#7cb342

Remember: Dataset = how to fetch one sample | DataLoader = how to feed batches efficiently

Step 2: Building the Model

class MyModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )

    def forward(self, x):
        return self.layers(x)

In PyTorch, all neural networks inherit from nn.Module. This base class handles parameter management, device movement, and training/evaluation modes automatically. Your job is to define the architecture in __init__ and specify how data flows through the model in forward().

2.1 Understanding nn.Module

Every PyTorch model follows this structure:

import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()  # Initialize parent class (REQUIRED)
        # Define layers here
    
    def forward(self, x):
        # Define how data flows through layers
        return output

Key Concepts:

Component	Purpose	When It Runs
`super().__init__()`	Initializes `nn.Module` functionality	Once during model creation
`__init__`	Defines what layers exist	Once during model creation
`forward()`	Defines computation graph	Every time you pass data through model

Why inherit from nn.Module?

Automatic parameter tracking (.parameters())
Easy device management (.to(device))
Training/eval mode switching (.train(), .eval())
State saving/loading (.state_dict())

2.2 Basic Model Architecture

Simple Feedforward Network (Tabular Data):

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.fc1(x)      # Input → Hidden layer
        x = self.relu(x)     # Apply activation
        x = self.fc2(x)      # Hidden → Output layer
        return x

# Instantiate model
model = SimpleNN(input_size=784, hidden_size=128, output_size=10)

# Check architecture
print(model)
# Output:
# SimpleNN(
#   (fc1): Linear(in_features=784, out_features=128, bias=True)
#   (relu): ReLU()
#   (fc2): Linear(in_features=128, out_features=10, bias=True)
# )

Using Sequential (More Compact):

class SimpleNNSequential(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.model = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),                      # Regularization
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )
    
    def forward(self, x):
        return self.model(x)

When to use Sequential vs. Explicit layers:

Sequential: Simple linear flow (A → B → C)
Explicit layers: Complex connections (skip connections, multiple branches)

2.3 Common Layer Types

Layer	Purpose	Typical Use Case	Input Shape Example
`nn.Linear(in, out)`	Fully connected layer	Tabular data, final classifier	`(batch, features)`
`nn.Conv2d(in, out, kernel)`	2D convolution	Image feature extraction	`(batch, channels, H, W)`
`nn.LSTM(in, hidden)`	Recurrent layer	Sequences, time series	`(seq_len, batch, features)`
`nn.Dropout(p)`	Random neuron disabling	Prevent overfitting	Any shape
`nn.BatchNorm2d(channels)`	Normalize activations	Stable training	`(batch, C, H, W)`
`nn.Embedding(vocab, dim)`	Word vectors	NLP tasks	`(batch, seq_len)`

2.4 Activation Functions

Activation functions introduce non-linearity, allowing networks to learn complex patterns. Without them, stacking layers would be equivalent to a single linear transformation.

📚 Want to dive deeper into activation functions? Read our comprehensive guide: Activation Functions: The Neurons' Decision Makers covering mathematical formulas, advantages/disadvantages, weight initialization strategies, activation-loss pairings, and practical PyTorch examples.

# Common activation functions
activations = {
    'ReLU': nn.ReLU(),           # Most common: max(0, x)
    'LeakyReLU': nn.LeakyReLU(), # Fixes dying ReLU: max(0.01x, x)
    'Tanh': nn.Tanh(),           # Range: [-1, 1]
    'Sigmoid': nn.Sigmoid(),     # Range: [0, 1]
    'Softmax': nn.Softmax(dim=1) # Probability distribution
}

Activation	Range	Best For	Downsides
ReLU	`[0, ∞)`	Hidden layers (most tasks)	Dying ReLU (neurons get stuck at 0)
LeakyReLU	`(-∞, ∞)`	When ReLU causes dead neurons	Slightly more computation
Tanh	`[-1, 1]`	RNNs, normalized outputs	Vanishing gradient
Sigmoid	`[0, 1]`	Binary classification output	Vanishing gradient
Softmax	`[0, 1]` (sum=1)	Multi-class output layer	Use with CrossEntropyLoss

Dying ReLU Problem:

# If ReLU input is always negative, gradient becomes 0
# Neuron never updates → "dead neuron"

# Solution: Use LeakyReLU or ensure proper initialization
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.LeakyReLU(0.01),  # Allows small negative gradient
    nn.Linear(50, 10)
)

2.5 Model Architectures for Different Tasks

Convolutional Neural Network (Image Classification):

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Convolutional feature extractor
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # 3 RGB channels → 32 filters
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                          # Downsample by 2
            
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        
        # Fully connected classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 512),  # Depends on input size
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

LSTM for Sequence Data (Text, Time Series):

class SequenceModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)        # (batch, seq_len, embedding_dim)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        output = self.fc(hidden[-1])        # Use final hidden state
        return output

Transfer Learning (Using Pretrained Models):

from torchvision import models

class TransferLearningModel(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        
        # Load pretrained ResNet
        self.backbone = models.resnet18(pretrained=True)
        
        # Freeze early layers
        for param in self.backbone.parameters():
            param.requires_grad = False
        
        # Replace final layer
        num_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Linear(num_features, num_classes)
    
    def forward(self, x):
        return self.backbone(x)

2.6 Device Management (CPU vs GPU)

Always move your model to the appropriate device before training:

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move model to device
model = SimpleNN(784, 128, 10)
model = model.to(device)

# Data must also be moved to the same device during training
# This happens in the training loop:
for X_batch, y_batch in train_loader:
    X_batch = X_batch.to(device)
    y_batch = y_batch.to(device)
    # ... training code

Multi-GPU Training:

if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)
model = model.to(device)

2.7 Model Inspection & Debugging

Count Parameters:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_params = count_parameters(model)
print(f"Total trainable parameters: {total_params:,}")

Check Output Shape:

# Create dummy input to verify dimensions
dummy_input = torch.randn(1, 3, 224, 224).to(device)  # (batch=1, channels=3, H=224, W=224)
output = model(dummy_input)
print(f"Output shape: {output.shape}")  # Should match expected output

Layer-by-Layer Inspection:

# Print each layer's output shape
def print_shapes(model, input_size):
    x = torch.randn(input_size)
    for name, layer in model.named_children():
        x = layer(x)
        print(f"{name}: {x.shape}")

print_shapes(model, (1, 3, 224, 224))

2.8 Common Pitfalls and Solutions

Mistake 1: Forgetting super().init()

# Wrong
class MyModel(nn.Module):
    def __init__(self):
        self.fc = nn.Linear(10, 5)  # Missing super().__init__()

# Correct
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()  # REQUIRED
        self.fc = nn.Linear(10, 5)

Mistake 2: Dimension Mismatches

# Wrong: Output of fc1 (128) doesn't match input of fc2 (256)
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(256, 10)  # Expects 256 inputs!

# Correct: Dimensions must align
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)  # Matches fc1 output

Mistake 3: Not Matching Model Input with Data

# Data shape: (batch, 28, 28) - 2D image
# Model expects: (batch, 784) - flattened

# Solution: Flatten in forward()
def forward(self, x):
    x = x.view(x.size(0), -1)  # Flatten: (batch, 28, 28) → (batch, 784)
    x = self.fc1(x)
    return x

Mistake 4: Using Softmax with CrossEntropyLoss

# Wrong: CrossEntropyLoss applies softmax internally
def forward(self, x):
    x = self.fc(x)
    return F.softmax(x, dim=1)  # DON'T DO THIS

# Correct: Return raw logits
def forward(self, x):
    return self.fc(x)  # Let CrossEntropyLoss handle softmax

Mistake 5: Training Mode in Evaluation

# Wrong: Dropout and BatchNorm behave differently in train vs eval mode
predictions = model(test_data)  # May still be in training mode

# Correct: Set to evaluation mode
model.eval()
with torch.no_grad():
    predictions = model(test_data)

Architecture Design Guidelines

flowchart LR
    A[Input Layer] --> B[Hidden Layers]
    B --> C[Output Layer]
    
    B --> D[Gradually Decrease<br/>Width]
    B --> E[Add Dropout<br/>for Regularization]
    B --> F[Use BatchNorm<br/>for Stability]
    
    style A fill:#e3f2fd,stroke:#1e88e5
    style C fill:#e3f2fd,stroke:#1e88e5
    style B fill:#f1f8e9,stroke:#7cb342

Common Patterns:

Pattern	Architecture	Use Case
Pyramid	784 → 512 → 256 → 128 → 10	Gradually compress information
Hourglass	100 → 50 → 25 → 50 → 100	Autoencoders, compression
Residual	Skip connections around blocks	Very deep networks (ResNet)
Wide & Shallow	100 → 500 → 10	Simple patterns, fast training
Narrow & Deep	100 → 80 → 60 → 40 → 20 → 10	Complex patterns, risk overfitting

Remember: Start simple, add complexity only if needed. A smaller model that generalizes well beats a massive overfitted model.

Step 3: Loss Function

Understanding Loss: Measuring Model Performance

A loss function quantifies the gap between your model's predictions and the actual targets. It converts this error into a single numerical value that the optimizer can minimize.

Core Principle: Lower loss = Better predictions | Higher loss = Worse predictions

📚 Want to dive deeper into loss functions? Read our comprehensive guide: Loss Functions: Measuring Model Performance covering mathematical foundations, detailed explanations of regression and classification losses, handling class imbalance, and implementing custom loss functions.

Common Loss Functions

Task Type	Loss Function	Purpose	Example Use Case
Regression	`MSELoss` (Mean Squared Error)	Penalizes large errors heavily	House price prediction, temperature forecasting
Multi-class Classification	`CrossEntropyLoss`	Measures confidence in correct class	Image classification (dog/cat/bird)
Binary Classification	`BCEWithLogitsLoss`	Optimizes binary decisions	Spam detection, disease diagnosis
Object Detection	`SmoothL1Loss`	Robust to outliers	Bounding box regression

Example Implementation:

# For multi-class classification
criterion = nn.CrossEntropyLoss()

# For regression tasks
criterion = nn.MSELoss()

# For binary classification
criterion = nn.BCEWithLogitsLoss()

Step 4: Optimizer

How Models Learn: Weight Updates

The optimizer adjusts model parameters (weights and biases) to minimize the loss function. It determines how the model learns from its mistakes.

Key Relationship:

Loss function → "How wrong am I?"
Optimizer → "How should I fix it?"

Popular Optimizers

Optimizer	Best For	Pros	Cons
SGD	Large datasets, simple problems	Memory efficient, well-understood	Can be slow, needs careful tuning
Adam	Most deep learning tasks	Adaptive learning rates, fast convergence	More memory, can overfit small datasets
AdamW	Transformers, modern architectures	Better regularization than Adam	Slightly more complex
RMSprop	Recurrent networks	Good for non-stationary objectives	Less popular than Adam

Example Setup:

import torch.optim as optim

# Adam optimizer (most common starting point)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# SGD with momentum (for large-scale training)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# AdamW (for transformer models)
optimizer = optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)

Understanding Learning Rate

The learning rate (lr) controls how much the model's weights change with each update. It's one of the most critical hyperparameters.

Weight Update Formula

new_weight = old_weight - lr × gradient

Learning Rate Guidelines

Learning Rate	Typical Range	Training Behavior	Visual Symptoms	When to Use
Too High	`0.1`+ (Adam), `1.0`+ (SGD)	Chaotic, unstable jumps	Loss spikes, NaN values, zigzag pattern	Never (unless using lr schedulers)
Optimal	`1e-3` to `1e-4` (Adam), `1e-2` to `1e-1` (SGD)	Smooth, steady improvement	Consistently decreasing loss curve	Start here, adjust based on results
Too Low	`1e-6` (Adam), `1e-5` (SGD)	Glacially slow progress	Flat loss curve, minimal change per epoch	When fine-tuning pretrained models

Visual Analogy: Walking Down a Hill

Think of training as finding the lowest point in a valley:

High learning rate → Taking giant leaps, might jump over the valley
Optimal learning rate → Steady, purposeful steps downward
Low learning rate → Tiny baby steps, takes forever to reach the bottom

Step 5: Training Loop

flowchart LR
    A[Input Batch] --> B[Forward Pass]
    B --> C[Compute Loss]
    C --> D[Zero Gradients]
    D --> E[Backward Pass]
    E --> F[Update Weights]
    F --> G{More Batches?}
    G -->|Yes| A
    G -->|No| H[Validation]

Complete Training Loop

def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()  # Set model to training mode
    total_loss = 0
    
    for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
        # 1. Move data to device (GPU/CPU)
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        # 2. Zero gradients from previous iteration
        optimizer.zero_grad()
        
        # 3. Forward pass: compute predictions
        predictions = model(X_batch)
        
        # 4. Compute loss
        loss = criterion(predictions, y_batch)
        
        # 5. Backward pass: compute gradients
        loss.backward()
        
        # 6. Update weights using gradients
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)

Key Operations:

Operation	Purpose	Why It's Needed
`optimizer.zero_grad()`	Clear old gradients	Gradients accumulate by default; we need fresh gradients each iteration
`loss.backward()`	Compute gradients	Calculates how each weight contributed to the loss (via chain rule)
`optimizer.step()`	Update weights	Applies the computed gradients to adjust model parameters
`model.to(device)`	Move to GPU	Accelerates computation 10-100× on compatible hardware
`model.train()`	Enable training mode	Activates dropout, batch norm training behavior

Step 6: Validation and Testing

Three Distinct Phases

flowchart TD
    A[Full Dataset] --> B[Training Set 70%]
    A --> C[Validation Set 15%]
    A --> D[Test Set 15%]
    
    B --> E[Update Weights]
    C --> F[Tune Hyperparameters<br/>Monitor Overfitting]
    D --> G[Final Evaluation<br/>Report Results]
    
    E --> H{Epoch Complete?}
    H -->|Yes| F
    H -->|No| E
    F --> I{Training Complete?}
    I -->|No| E
    I -->|Yes| G

Validation During Training

Purpose: Catch overfitting early and guide hyperparameter tuning

def validate(model, val_loader, criterion, device):
    model.eval()  # Disable dropout, fix batch norm
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():  # Save memory, speed up computation
        for X_val, y_val in val_loader:
            X_val, y_val = X_val.to(device), y_val.to(device)
            
            predictions = model(X_val)
            loss = criterion(predictions, y_val)
            
            total_loss += loss.item()
            
            # For classification: compute accuracy
            _, predicted = torch.max(predictions, 1)
            total += y_val.size(0)
            correct += (predicted == y_val).sum().item()
    
    accuracy = 100 * correct / total
    avg_loss = total_loss / len(val_loader)
    
    return avg_loss, accuracy

Testing After Training

Purpose: Unbiased evaluation of final model performance

def test(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    all_predictions = []
    all_targets = []
    
    with torch.no_grad():
        for X_test, y_test in test_loader:
            X_test, y_test = X_test.to(device), y_test.to(device)
            
            predictions = model(X_test)
            _, predicted = torch.max(predictions, 1)
            
            all_predictions.extend(predicted.cpu().numpy())
            all_targets.extend(y_test.cpu().numpy())
            
            total += y_test.size(0)
            correct += (predicted == y_test).sum().item()
    
    # Compute comprehensive metrics
    from sklearn.metrics import classification_report
    print(classification_report(all_targets, all_predictions))
    
    return 100 * correct / total

Phase	When	Purpose	Affects Training?
Training	Every epoch	Update weights to minimize loss	Yes
Validation	After each epoch	Monitor overfitting, tune hyperparameters	Indirectly (early stopping, learning rate adjustment)
Testing	Once, at the end	Report final performance	No

flowchart LR
    D[Dataset] --> T[Training Loader<br/>shuffle=True]
    D --> V[Validation Loader<br/>shuffle=False]
    D --> TS[Test Loader<br/>shuffle=False]

    T --> M[Model Training]
    V --> E[Monitor Performance<br/>During Training]
    TS --> F[Final Evaluation<br/>After Training]

Step 7: Saving Model and Inference**

After training we can save the weights then we can use it to reload model later, avoid retraining, and deploy it.

# ✓ RECOMMENDED: Save only weights (portable, version-safe)
torch.save(model.state_dict(), 'model_weights.pth')

# Load weights later
model = YourModelClass()  # Must define architecture first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# ✓ GOOD: Save checkpoint with training state
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
    'accuracy': accuracy
}
torch.save(checkpoint, 'checkpoint.pth')

# ✗ NOT RECOMMENDED: Save entire model (fragile, version-dependent)
torch.save(model, 'entire_model.pth')  # Breaks if PyTorch version changes

Model Saving Strategies

flowchart TD
    A[Training Complete] --> B{What to Save?}
    
    B -->|state_dict only| C[✓ Portable<br/>✓ Version-safe<br/>✓ Recommended]
    B -->|Entire model| D[✗ Version-dependent<br/>✗ Fragile<br/>✗ Not recommended]
    B -->|Full checkpoint| E[✓ Resume training<br/>✓ Track progress<br/>✓ Best for experiments]
    
    C --> F[Production Deployment]
    E --> G[Research & Development]
    D --> H[Legacy Code Only]

Inference Mode

# Prepare model for production
model.eval()  # Disable dropout, batch norm training mode

# Make predictions
with torch.no_grad():  # Disable gradient computation
    input_tensor = preprocess(input_data)
    input_tensor = input_tensor.to(device)
    
    output = model(input_tensor)
    prediction = torch.argmax(output, dim=1)

print(f"Predicted class: {prediction.item()}")

Quick Reference: Full Training Script

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = YourModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 50
best_val_loss = float('inf')

for epoch in range(num_epochs):
    # Training phase
    train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    
    # Validation phase
    val_loss, val_acc = validate(model, val_loader, criterion, device)
    
    print(f'Epoch {epoch+1}/{num_epochs}')
    print(f'Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}%')
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pth')

# Final testing
test_acc = test(model, test_loader, device)
print(f'Test Accuracy: {test_acc:.2f}%')