PyTorch Deep Learning Workflow: From Data to Deployment
Building a deep learning model can feel overwhelming with so many moving parts—datasets, dataloaders, loss functions, optimizers, and training loops. This guide breaks down the entire PyTorch workflow into 7 clear, actionable steps that take you from raw data to a deployed model.
Whether you're training your first neural network or refining your understanding of PyTorch fundamentals, this structured approach will help you build models systematically and avoid common pitfalls.
Hands-on Example: Google Colab Notebook - Image Classification
The 7-Step PyTorch Workflow
Every PyTorch project follows this systematic pipeline:
- DATA → Dataset → Dataloaders
- Model Definition → Neural network architecture
- Loss Function → Measure prediction error
- Optimizer → Update weights to minimize loss
- Training Loop → Iterate and improve
- Evaluation → Validate and test performance
- Deployment → Save and use the model
flowchart TD
A[Problem Definition] --> B[Raw Data]
B --> C[Dataset<br/>__len__ & __getitem__]
C --> D[DataLoader<br/>Batching & Shuffling]
D --> E[Model<br/>nn.Module]
E --> F[Loss Function]
F --> G[Optimizer]
G --> H[Training Loop]
H --> I[Validation]
I --> J[Testing]
J --> K[Saved Model]
K --> L[Inference / Deployment]
A. DATA
-
Dataset: PyTorch doesn’t iterate through NumPy arrays or CSV files rather it expects tensors. Raw data might be: images, tabular data, time series, audio, etc. Pytorch expects a Dataset object through a Dataset abstraction..
Dataset must be able to:
- Give how many samples it has:
__len__ - Return one sample at a time:
__getitem__
class MyDataset(Dataset): def __init__(self, X, Y): self.X = torch.tensor(X, dtype=torch.float32) self.Y = torch.tensor(Y, dtype=torch.float32) def __len__(self): return len(self.X) def __getitem__(self, index): return self.X[index], self.Y[index] - Give how many samples it has:
-
DataLoaders: The DataLoader is responsible for feeding data efficiently to the model. PyTorch’s DataLoader does batching which reduces the effort to manually loop through samples ‘n’ at a time. It also do shuffling(better training), loads in parallel, provides batches in a clean loop.
Note:
shuffle=Trueis used only for training to introduce randomness and improve generalization- Validation and test loaders should use
shuffle=Falsefor stable, reproducible metrics - DataLoader returns CPU tensors; moving data to GPU is handled inside the training loop
Dataset = how to fetch data
DataLoader = how to feed data to the model efficiently
batch1 = dataset[0:32] # 32 samples at one time batch2 = dataset[32:64] ... training_data = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
flowchart LR
A[Raw Data<br/>Files / DB / CSV] --> B[Dataset]
B -->|One sample| C[DataLoader]
C -->|Batch of samples| D[Training Loop]
B:::dataset
C:::loader
classDef dataset fill:#e3f2fd,stroke:#1e88e5
classDef loader fill:#f1f8e9,stroke:#7cb342
Step 1: Preparing Data with Dataset and DataLoader
PyTorch doesn't iterate through raw NumPy arrays or CSV files—it expects tensors wrapped in a Dataset abstraction. This standardized approach works for any data type: images, tabular data, time series, audio, or text.
1.1 Data Splitting: Train, Validation, Test
Before creating datasets, split your raw data into three subsets to prevent data leakage and get reliable performance metrics:
from sklearn.model_selection import train_test_split
# Split: 70% train, 15% validation, 15% test
train_data, temp_data, train_labels, temp_labels = train_test_split(
X, y, test_size=0.3, random_state=42
)
val_data, test_data, val_labels, test_labels = train_test_split(
temp_data, temp_labels, test_size=0.5, random_state=42
)
| Split | Size | Purpose | When to Use |
|---|---|---|---|
| Training | 60-80% | Update model weights | Every epoch during training |
| Validation | 10-20% | Monitor overfitting, tune hyperparameters | After each epoch |
| Test | 10-20% | Final performance evaluation | Once at the very end |
Critical Note: Never use test data during training or hyperparameter tuning—it must remain completely unseen.
1.2 Creating a Custom Dataset
A PyTorch Dataset must implement two essential methods:
__len__→ Returns the total number of samples__getitem__→ Returns one sample at a given index
Basic Example (Tabular Data):
from torch.utils.data import Dataset
import torch
class TabularDataset(Dataset):
def __init__(self, X, y):
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.float32)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
# Create datasets for each split
train_dataset = TabularDataset(train_data, train_labels)
val_dataset = TabularDataset(val_data, val_labels)
test_dataset = TabularDataset(test_data, test_labels)
Advanced Example (Images with Transforms):
from torchvision import transforms
from PIL import Image
class ImageDataset(Dataset):
def __init__(self, image_paths, labels, transform=None):
self.image_paths = image_paths
self.labels = labels
self.transform = transform
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
# Load image
image = Image.open(self.image_paths[idx]).convert('RGB')
label = self.labels[idx]
# Apply transforms (resize, normalize, augment)
if self.transform:
image = self.transform(image)
return image, label
# Define transforms
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(), # Data augmentation
transforms.RandomRotation(10),
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
val_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
train_dataset = ImageDataset(train_paths, train_labels, transform=train_transform)
val_dataset = ImageDataset(val_paths, val_labels, transform=val_transform)
Why Transforms?
- Normalization: Speeds up training by scaling pixel values to a standard range
- Augmentation: Creates variations (flips, rotations) to improve generalization
- Consistency: Ensures all images have the same dimensions
1.3 DataLoader: Efficient Batch Processing
The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading automatically.
from torch.utils.data import DataLoader
train_loader = DataLoader(
train_dataset,
batch_size=32, # Number of samples per batch
shuffle=True, # Randomize order each epoch
num_workers=4, # Parallel data loading
pin_memory=True # Faster GPU transfer (if using CUDA)
)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
Key Parameters Explained:
| Parameter | Typical Value | Purpose | Notes |
|---|---|---|---|
batch_size |
32, 64, 128 | Number of samples processed together | Larger = more GPU memory, faster training, less stable |
shuffle |
True (train only) |
Randomize sample order | Prevents model from learning order patterns |
num_workers |
4-8 | Parallel CPU threads for loading | Set to 0 for debugging; higher = faster loading |
pin_memory |
True (if GPU) |
Speeds up CPU-to-GPU transfer | Only use when training on GPU |
How to Choose Batch Size:
# Rule of thumb: Start with 32, then increase until GPU memory is ~80% full
# Check GPU memory usage:
# nvidia-smi (in terminal)
# Too large → Out of memory error
# Too small → Slow training, unstable gradients
# Just right → Fast training with stable convergence
Understanding num_workers:
# num_workers=0 → Main process loads data (slower, but good for debugging)
# num_workers=4 → 4 parallel workers load data while GPU trains (faster)
# num_workers=8 → Even faster, but diminishing returns beyond CPU core count
1.4 Common Pitfalls and How to Avoid Them
Mistake 1: Not Converting to Tensors
# Wrong
def __getitem__(self, idx):
return self.data[idx] # Returns NumPy array or list
# Correct
def __getitem__(self, idx):
return torch.tensor(self.data[idx], dtype=torch.float32)
Mistake 2: Data Leakage (Using Test Data During Training)
# Wrong: Using test data to normalize training data
scaler.fit(X_test) # DON'T DO THIS
# Correct: Fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Mistake 3: Shuffling Validation/Test Data
# Wrong
val_loader = DataLoader(val_dataset, shuffle=True) # Unnecessary and inconsistent
# Correct
val_loader = DataLoader(val_dataset, shuffle=False) # Reproducible metrics
Mistake 4: Dimension Mismatches
# Wrong: Shape (batch, features) but model expects (batch, channels, height, width)
return self.images[idx] # Shape: (224, 224, 3)
# Correct: Transpose to (channels, height, width)
image = self.images[idx].transpose(2, 0, 1) # Shape: (3, 224, 224)
# Or use transforms.ToTensor() which handles this automatically
The Complete Data Pipeline
flowchart TB
A[Raw Data] --> B[Train/Val/Test Split]
B --> C1[Train Dataset]
B --> C2[Val Dataset]
B --> C3[Test Dataset]
C1 --> D1[Train DataLoader<br/>batch_size=32<br/>shuffle=True<br/>num_workers=4]
C2 --> D2[Val DataLoader<br/>batch_size=32<br/>shuffle=False]
C3 --> D3[Test DataLoader<br/>batch_size=32<br/>shuffle=False]
D1 --> E[Training Loop]
D2 --> F[Validation Loop]
D3 --> G[Final Testing]
style C1 fill:#e3f2fd,stroke:#1e88e5
style C2 fill:#e3f2fd,stroke:#1e88e5
style C3 fill:#e3f2fd,stroke:#1e88e5
style D1 fill:#f1f8e9,stroke:#7cb342
style D2 fill:#f1f8e9,stroke:#7cb342
style D3 fill:#f1f8e9,stroke:#7cb342
Remember: Dataset = how to fetch one sample | DataLoader = how to feed batches efficiently
Step 2: Building the Model
class MyModel(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, output_size)
)
def forward(self, x):
return self.layers(x)
In PyTorch, all neural networks inherit from nn.Module. This base class handles parameter management, device movement, and training/evaluation modes automatically. Your job is to define the architecture in __init__ and specify how data flows through the model in forward().
2.1 Understanding nn.Module
Every PyTorch model follows this structure:
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self):
super().__init__() # Initialize parent class (REQUIRED)
# Define layers here
def forward(self, x):
# Define how data flows through layers
return output
Key Concepts:
| Component | Purpose | When It Runs |
|---|---|---|
super().__init__() |
Initializes nn.Module functionality |
Once during model creation |
__init__ |
Defines what layers exist | Once during model creation |
forward() |
Defines computation graph | Every time you pass data through model |
Why inherit from nn.Module?
- Automatic parameter tracking (
.parameters()) - Easy device management (
.to(device)) - Training/eval mode switching (
.train(),.eval()) - State saving/loading (
.state_dict())
2.2 Basic Model Architecture
Simple Feedforward Network (Tabular Data):
class SimpleNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.fc1(x) # Input → Hidden layer
x = self.relu(x) # Apply activation
x = self.fc2(x) # Hidden → Output layer
return x
# Instantiate model
model = SimpleNN(input_size=784, hidden_size=128, output_size=10)
# Check architecture
print(model)
# Output:
# SimpleNN(
# (fc1): Linear(in_features=784, out_features=128, bias=True)
# (relu): ReLU()
# (fc2): Linear(in_features=128, out_features=10, bias=True)
# )
Using Sequential (More Compact):
class SimpleNNSequential(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.model = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.2), # Regularization
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, output_size)
)
def forward(self, x):
return self.model(x)
When to use Sequential vs. Explicit layers:
- Sequential: Simple linear flow (A → B → C)
- Explicit layers: Complex connections (skip connections, multiple branches)
2.3 Common Layer Types
| Layer | Purpose | Typical Use Case | Input Shape Example |
|---|---|---|---|
nn.Linear(in, out) |
Fully connected layer | Tabular data, final classifier | (batch, features) |
nn.Conv2d(in, out, kernel) |
2D convolution | Image feature extraction | (batch, channels, H, W) |
nn.LSTM(in, hidden) |
Recurrent layer | Sequences, time series | (seq_len, batch, features) |
nn.Dropout(p) |
Random neuron disabling | Prevent overfitting | Any shape |
nn.BatchNorm2d(channels) |
Normalize activations | Stable training | (batch, C, H, W) |
nn.Embedding(vocab, dim) |
Word vectors | NLP tasks | (batch, seq_len) |
2.4 Activation Functions
Activation functions introduce non-linearity, allowing networks to learn complex patterns. Without them, stacking layers would be equivalent to a single linear transformation.
📚 Want to dive deeper into activation functions? Read our comprehensive guide: Activation Functions: The Neurons' Decision Makers covering mathematical formulas, advantages/disadvantages, weight initialization strategies, activation-loss pairings, and practical PyTorch examples.
# Common activation functions
activations = {
'ReLU': nn.ReLU(), # Most common: max(0, x)
'LeakyReLU': nn.LeakyReLU(), # Fixes dying ReLU: max(0.01x, x)
'Tanh': nn.Tanh(), # Range: [-1, 1]
'Sigmoid': nn.Sigmoid(), # Range: [0, 1]
'Softmax': nn.Softmax(dim=1) # Probability distribution
}
| Activation | Range | Best For | Downsides |
|---|---|---|---|
| ReLU | [0, ∞) |
Hidden layers (most tasks) | Dying ReLU (neurons get stuck at 0) |
| LeakyReLU | (-∞, ∞) |
When ReLU causes dead neurons | Slightly more computation |
| Tanh | [-1, 1] |
RNNs, normalized outputs | Vanishing gradient |
| Sigmoid | [0, 1] |
Binary classification output | Vanishing gradient |
| Softmax | [0, 1] (sum=1) |
Multi-class output layer | Use with CrossEntropyLoss |
Dying ReLU Problem:
# If ReLU input is always negative, gradient becomes 0
# Neuron never updates → "dead neuron"
# Solution: Use LeakyReLU or ensure proper initialization
model = nn.Sequential(
nn.Linear(100, 50),
nn.LeakyReLU(0.01), # Allows small negative gradient
nn.Linear(50, 10)
)
2.5 Model Architectures for Different Tasks
Convolutional Neural Network (Image Classification):
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Convolutional feature extractor
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1), # 3 RGB channels → 32 filters
nn.ReLU(),
nn.MaxPool2d(2, 2), # Downsample by 2
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
# Fully connected classifier
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 512), # Depends on input size
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
LSTM for Sequence Data (Text, Time Series):
class SequenceModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# x: (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embedding_dim)
lstm_out, (hidden, cell) = self.lstm(embedded)
output = self.fc(hidden[-1]) # Use final hidden state
return output
Transfer Learning (Using Pretrained Models):
from torchvision import models
class TransferLearningModel(nn.Module):
def __init__(self, num_classes):
super().__init__()
# Load pretrained ResNet
self.backbone = models.resnet18(pretrained=True)
# Freeze early layers
for param in self.backbone.parameters():
param.requires_grad = False
# Replace final layer
num_features = self.backbone.fc.in_features
self.backbone.fc = nn.Linear(num_features, num_classes)
def forward(self, x):
return self.backbone(x)
2.6 Device Management (CPU vs GPU)
Always move your model to the appropriate device before training:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Move model to device
model = SimpleNN(784, 128, 10)
model = model.to(device)
# Data must also be moved to the same device during training
# This happens in the training loop:
for X_batch, y_batch in train_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
# ... training code
Multi-GPU Training:
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs")
model = nn.DataParallel(model)
model = model.to(device)
2.7 Model Inspection & Debugging
Count Parameters:
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = count_parameters(model)
print(f"Total trainable parameters: {total_params:,}")
Check Output Shape:
# Create dummy input to verify dimensions
dummy_input = torch.randn(1, 3, 224, 224).to(device) # (batch=1, channels=3, H=224, W=224)
output = model(dummy_input)
print(f"Output shape: {output.shape}") # Should match expected output
Layer-by-Layer Inspection:
# Print each layer's output shape
def print_shapes(model, input_size):
x = torch.randn(input_size)
for name, layer in model.named_children():
x = layer(x)
print(f"{name}: {x.shape}")
print_shapes(model, (1, 3, 224, 224))
2.8 Common Pitfalls and Solutions
Mistake 1: Forgetting super().init()
# Wrong
class MyModel(nn.Module):
def __init__(self):
self.fc = nn.Linear(10, 5) # Missing super().__init__()
# Correct
class MyModel(nn.Module):
def __init__(self):
super().__init__() # REQUIRED
self.fc = nn.Linear(10, 5)
Mistake 2: Dimension Mismatches
# Wrong: Output of fc1 (128) doesn't match input of fc2 (256)
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(256, 10) # Expects 256 inputs!
# Correct: Dimensions must align
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10) # Matches fc1 output
Mistake 3: Not Matching Model Input with Data
# Data shape: (batch, 28, 28) - 2D image
# Model expects: (batch, 784) - flattened
# Solution: Flatten in forward()
def forward(self, x):
x = x.view(x.size(0), -1) # Flatten: (batch, 28, 28) → (batch, 784)
x = self.fc1(x)
return x
Mistake 4: Using Softmax with CrossEntropyLoss
# Wrong: CrossEntropyLoss applies softmax internally
def forward(self, x):
x = self.fc(x)
return F.softmax(x, dim=1) # DON'T DO THIS
# Correct: Return raw logits
def forward(self, x):
return self.fc(x) # Let CrossEntropyLoss handle softmax
Mistake 5: Training Mode in Evaluation
# Wrong: Dropout and BatchNorm behave differently in train vs eval mode
predictions = model(test_data) # May still be in training mode
# Correct: Set to evaluation mode
model.eval()
with torch.no_grad():
predictions = model(test_data)
Architecture Design Guidelines
flowchart LR
A[Input Layer] --> B[Hidden Layers]
B --> C[Output Layer]
B --> D[Gradually Decrease<br/>Width]
B --> E[Add Dropout<br/>for Regularization]
B --> F[Use BatchNorm<br/>for Stability]
style A fill:#e3f2fd,stroke:#1e88e5
style C fill:#e3f2fd,stroke:#1e88e5
style B fill:#f1f8e9,stroke:#7cb342
Common Patterns:
| Pattern | Architecture | Use Case |
|---|---|---|
| Pyramid | 784 → 512 → 256 → 128 → 10 | Gradually compress information |
| Hourglass | 100 → 50 → 25 → 50 → 100 | Autoencoders, compression |
| Residual | Skip connections around blocks | Very deep networks (ResNet) |
| Wide & Shallow | 100 → 500 → 10 | Simple patterns, fast training |
| Narrow & Deep | 100 → 80 → 60 → 40 → 20 → 10 | Complex patterns, risk overfitting |
Remember: Start simple, add complexity only if needed. A smaller model that generalizes well beats a massive overfitted model.
Step 3: Loss Function
Understanding Loss: Measuring Model Performance
A loss function quantifies the gap between your model's predictions and the actual targets. It converts this error into a single numerical value that the optimizer can minimize.
Core Principle: Lower loss = Better predictions | Higher loss = Worse predictions
📚 Want to dive deeper into loss functions? Read our comprehensive guide: Loss Functions: Measuring Model Performance covering mathematical foundations, detailed explanations of regression and classification losses, handling class imbalance, and implementing custom loss functions.
Common Loss Functions
| Task Type | Loss Function | Purpose | Example Use Case |
|---|---|---|---|
| Regression | MSELoss (Mean Squared Error) |
Penalizes large errors heavily | House price prediction, temperature forecasting |
| Multi-class Classification | CrossEntropyLoss |
Measures confidence in correct class | Image classification (dog/cat/bird) |
| Binary Classification | BCEWithLogitsLoss |
Optimizes binary decisions | Spam detection, disease diagnosis |
| Object Detection | SmoothL1Loss |
Robust to outliers | Bounding box regression |
Example Implementation:
# For multi-class classification
criterion = nn.CrossEntropyLoss()
# For regression tasks
criterion = nn.MSELoss()
# For binary classification
criterion = nn.BCEWithLogitsLoss()
Step 4: Optimizer
How Models Learn: Weight Updates
The optimizer adjusts model parameters (weights and biases) to minimize the loss function. It determines how the model learns from its mistakes.
Key Relationship:
- Loss function → "How wrong am I?"
- Optimizer → "How should I fix it?"
Popular Optimizers
| Optimizer | Best For | Pros | Cons |
|---|---|---|---|
| SGD | Large datasets, simple problems | Memory efficient, well-understood | Can be slow, needs careful tuning |
| Adam | Most deep learning tasks | Adaptive learning rates, fast convergence | More memory, can overfit small datasets |
| AdamW | Transformers, modern architectures | Better regularization than Adam | Slightly more complex |
| RMSprop | Recurrent networks | Good for non-stationary objectives | Less popular than Adam |
Example Setup:
import torch.optim as optim
# Adam optimizer (most common starting point)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# SGD with momentum (for large-scale training)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# AdamW (for transformer models)
optimizer = optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)
Understanding Learning Rate
The learning rate (lr) controls how much the model's weights change with each update. It's one of the most critical hyperparameters.
Weight Update Formula
new_weight = old_weight - lr × gradient
Learning Rate Guidelines
| Learning Rate | Typical Range | Training Behavior | Visual Symptoms | When to Use |
|---|---|---|---|---|
| Too High | 0.1+ (Adam), 1.0+ (SGD) |
Chaotic, unstable jumps | Loss spikes, NaN values, zigzag pattern | Never (unless using lr schedulers) |
| Optimal | 1e-3 to 1e-4 (Adam), 1e-2 to 1e-1 (SGD) |
Smooth, steady improvement | Consistently decreasing loss curve | Start here, adjust based on results |
| Too Low | 1e-6 (Adam), 1e-5 (SGD) |
Glacially slow progress | Flat loss curve, minimal change per epoch | When fine-tuning pretrained models |
Visual Analogy: Walking Down a Hill
Think of training as finding the lowest point in a valley:
- High learning rate → Taking giant leaps, might jump over the valley
- Optimal learning rate → Steady, purposeful steps downward
- Low learning rate → Tiny baby steps, takes forever to reach the bottom
Step 5: Training Loop
flowchart LR
A[Input Batch] --> B[Forward Pass]
B --> C[Compute Loss]
C --> D[Zero Gradients]
D --> E[Backward Pass]
E --> F[Update Weights]
F --> G{More Batches?}
G -->|Yes| A
G -->|No| H[Validation]
Complete Training Loop
def train_epoch(model, train_loader, criterion, optimizer, device):
model.train() # Set model to training mode
total_loss = 0
for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
# 1. Move data to device (GPU/CPU)
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
# 2. Zero gradients from previous iteration
optimizer.zero_grad()
# 3. Forward pass: compute predictions
predictions = model(X_batch)
# 4. Compute loss
loss = criterion(predictions, y_batch)
# 5. Backward pass: compute gradients
loss.backward()
# 6. Update weights using gradients
optimizer.step()
total_loss += loss.item()
return total_loss / len(train_loader)
Key Operations:
| Operation | Purpose | Why It's Needed |
|---|---|---|
optimizer.zero_grad() |
Clear old gradients | Gradients accumulate by default; we need fresh gradients each iteration |
loss.backward() |
Compute gradients | Calculates how each weight contributed to the loss (via chain rule) |
optimizer.step() |
Update weights | Applies the computed gradients to adjust model parameters |
model.to(device) |
Move to GPU | Accelerates computation 10-100× on compatible hardware |
model.train() |
Enable training mode | Activates dropout, batch norm training behavior |
Step 6: Validation and Testing
Three Distinct Phases
flowchart TD
A[Full Dataset] --> B[Training Set 70%]
A --> C[Validation Set 15%]
A --> D[Test Set 15%]
B --> E[Update Weights]
C --> F[Tune Hyperparameters<br/>Monitor Overfitting]
D --> G[Final Evaluation<br/>Report Results]
E --> H{Epoch Complete?}
H -->|Yes| F
H -->|No| E
F --> I{Training Complete?}
I -->|No| E
I -->|Yes| G
Validation During Training
Purpose: Catch overfitting early and guide hyperparameter tuning
def validate(model, val_loader, criterion, device):
model.eval() # Disable dropout, fix batch norm
total_loss = 0
correct = 0
total = 0
with torch.no_grad(): # Save memory, speed up computation
for X_val, y_val in val_loader:
X_val, y_val = X_val.to(device), y_val.to(device)
predictions = model(X_val)
loss = criterion(predictions, y_val)
total_loss += loss.item()
# For classification: compute accuracy
_, predicted = torch.max(predictions, 1)
total += y_val.size(0)
correct += (predicted == y_val).sum().item()
accuracy = 100 * correct / total
avg_loss = total_loss / len(val_loader)
return avg_loss, accuracy
Testing After Training
Purpose: Unbiased evaluation of final model performance
def test(model, test_loader, device):
model.eval()
correct = 0
total = 0
all_predictions = []
all_targets = []
with torch.no_grad():
for X_test, y_test in test_loader:
X_test, y_test = X_test.to(device), y_test.to(device)
predictions = model(X_test)
_, predicted = torch.max(predictions, 1)
all_predictions.extend(predicted.cpu().numpy())
all_targets.extend(y_test.cpu().numpy())
total += y_test.size(0)
correct += (predicted == y_test).sum().item()
# Compute comprehensive metrics
from sklearn.metrics import classification_report
print(classification_report(all_targets, all_predictions))
return 100 * correct / total
| Phase | When | Purpose | Affects Training? |
|---|---|---|---|
| Training | Every epoch | Update weights to minimize loss | Yes |
| Validation | After each epoch | Monitor overfitting, tune hyperparameters | Indirectly (early stopping, learning rate adjustment) |
| Testing | Once, at the end | Report final performance | No |
flowchart LR
D[Dataset] --> T[Training Loader<br/>shuffle=True]
D --> V[Validation Loader<br/>shuffle=False]
D --> TS[Test Loader<br/>shuffle=False]
T --> M[Model Training]
V --> E[Monitor Performance<br/>During Training]
TS --> F[Final Evaluation<br/>After Training]
Step 7: Saving Model and Inference**
After training we can save the weights then we can use it to reload model later, avoid retraining, and deploy it.
# ✓ RECOMMENDED: Save only weights (portable, version-safe)
torch.save(model.state_dict(), 'model_weights.pth')
# Load weights later
model = YourModelClass() # Must define architecture first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()
# ✓ GOOD: Save checkpoint with training state
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'accuracy': accuracy
}
torch.save(checkpoint, 'checkpoint.pth')
# ✗ NOT RECOMMENDED: Save entire model (fragile, version-dependent)
torch.save(model, 'entire_model.pth') # Breaks if PyTorch version changes
Model Saving Strategies
flowchart TD
A[Training Complete] --> B{What to Save?}
B -->|state_dict only| C[✓ Portable<br/>✓ Version-safe<br/>✓ Recommended]
B -->|Entire model| D[✗ Version-dependent<br/>✗ Fragile<br/>✗ Not recommended]
B -->|Full checkpoint| E[✓ Resume training<br/>✓ Track progress<br/>✓ Best for experiments]
C --> F[Production Deployment]
E --> G[Research & Development]
D --> H[Legacy Code Only]
Inference Mode
# Prepare model for production
model.eval() # Disable dropout, batch norm training mode
# Make predictions
with torch.no_grad(): # Disable gradient computation
input_tensor = preprocess(input_data)
input_tensor = input_tensor.to(device)
output = model(input_tensor)
prediction = torch.argmax(output, dim=1)
print(f"Predicted class: {prediction.item()}")
Quick Reference: Full Training Script
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = YourModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
num_epochs = 50
best_val_loss = float('inf')
for epoch in range(num_epochs):
# Training phase
train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
# Validation phase
val_loss, val_acc = validate(model, val_loader, criterion, device)
print(f'Epoch {epoch+1}/{num_epochs}')
print(f'Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}%')
# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')
# Final testing
test_acc = test(model, test_loader, device)
print(f'Test Accuracy: {test_acc:.2f}%')