Part 2: Autograd and Automatic Differentiation

The Night I Understood Gradients

I spent three days debugging a neural network. The loss wouldn't decrease. I'd made a sign error in the gradient calculation - a minus instead of plus in line 47 of my 200-line backward pass.

That's when I tried PyTorch's autograd. One line: loss.backward(). All gradients computed correctly, automatically.

That bug taught me: never manually calculate gradients again.

Autograd is PyTorch's superpower. Let me show you how it works.

What is Automatic Differentiation?

Automatic differentiation (autograd) computes derivatives automatically by tracking operations on tensors.

Why it matters:

Neural networks learn by computing gradients
Manual gradient calculation is error-prone
Autograd handles complex architectures automatically

The magic: Every operation on tensors with requires_grad=True is tracked in a computational graph.

Basic Autograd

Simple Example

import torch

# Create tensor that requires gradient
x = torch.tensor([2.0], requires_grad=True)

# Forward pass: compute function
y = x ** 2 + 3 * x + 1

print(f"x = {x.item()}")
print(f"y = {y.item()}")

# Backward pass: compute gradient
y.backward()

# Gradient: dy/dx = 2x + 3
print(f"dy/dx = {x.grad.item()}")  # Should be 2*2 + 3 = 7

Output:

x = 2.0
y = 11.0
dy/dx = 7.0

Verify manually: y = x² + 3x + 1, dy/dx = 2x + 3 = 2(2) + 3 = 7 ✓

Multiple Operations

import torch

x = torch.tensor([3.0], requires_grad=True)

# Chain of operations
a = x * 2
b = a + 5
c = b ** 2
y = c.mean()

print(f"y = {y.item()}")

# Compute gradients
y.backward()

# dy/dx at x=3
print(f"dy/dx = {x.grad.item()}")

Under the hood:

x = 3
a = 3 * 2 = 6
b = 6 + 5 = 11
c = 11² = 121
y = 121

Gradient (chain rule):

dy/dc = 1
dc/db = 2b = 22
db/da = 1
da/dx = 2
dy/dx = 1 * 22 * 1 * 2 = 44

dy/dx = 44.0

Perfect!

Computational Graphs

When you perform operations on tensors, PyTorch builds a computational graph.

import torch

x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)

# Build graph
z = x * y
w = z + 2
loss = w ** 2

print(f"loss = {loss.item()}")

# Backward pass follows graph
loss.backward()

print(f"dloss/dx = {x.grad.item()}")
print(f"dloss/dy = {y.grad.item()}")

Graph structure:

x (2.0) ──┐
          ├── * ── z (6.0) ── + ── w (8.0) ── ** ── loss (64.0)
y (3.0) ──┘                   │
                              2

Gradients flow backward through this graph using chain rule.

Leaf vs Non-Leaf Tensors

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x * 2
z = y + 3

print(f"x is leaf: {x.is_leaf}")  # True
print(f"y is leaf: {y.is_leaf}")  # False
print(f"z is leaf: {z.is_leaf}")  # False

# Only leaf tensors store gradients by default
z.backward()
print(f"x.grad: {x.grad}")  # Has gradient
print(f"y.grad: {y.grad}")  # None (non-leaf)

Leaf tensors: Created by user (requires_grad=True) Non-leaf tensors: Results of operations

Gradient Accumulation

By default, gradients accumulate (don't overwrite).

import torch

x = torch.tensor([2.0], requires_grad=True)

# First backward pass
y1 = x ** 2
y1.backward()
print(f"After first backward: {x.grad.item()}")  # 4.0

# Second backward pass (accumulates!)
y2 = x ** 3
y2.backward()
print(f"After second backward: {x.grad.item()}")  # 4.0 + 12.0 = 16.0

# Zero gradients before next iteration
x.grad.zero_()
y3 = x ** 2
y3.backward()
print(f"After zeroing: {x.grad.item()}")  # 4.0

In training loops, always zero gradients:

optimizer.zero_grad()  # Zero all parameter gradients
loss.backward()        # Compute gradients
optimizer.step()       # Update parameters

Controlling Gradient Computation

Detach from Graph

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Detach y from graph
y_detached = y.detach()

z = y_detached * 3
z.backward()  # Error! y_detached doesn't require grad

Use case: When you want to use a value but not backpropagate through it.

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Use value without gradients
with torch.no_grad():
    z = y * 3
    print(f"z = {z.item()}")  # Can still use the value

# z doesn't require grad
print(f"z requires_grad: {z.requires_grad}")  # False

No Grad Context

import torch

x = torch.tensor([2.0], requires_grad=True)

# Normal: requires grad
y1 = x ** 2
print(f"y1 requires_grad: {y1.requires_grad}")  # True

# With no_grad: doesn't require grad
with torch.no_grad():
    y2 = x ** 2
    print(f"y2 requires_grad: {y2.requires_grad}")  # False

I use no_grad for inference - saves memory and speeds up computation.

model.eval()
with torch.no_grad():
    predictions = model(inputs)

Inference Mode

import torch

# Even faster than no_grad
with torch.inference_mode():
    # All operations disable gradient tracking
    predictions = model(inputs)

Difference: inference_mode disables more autograd machinery - faster but less flexible.

Gradient for Non-Scalar Outputs

When output is not a scalar, you need to specify gradients.

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# For scalar, just call backward()
# For vector, must provide gradients
gradient = torch.tensor([1.0, 1.0, 1.0])
y.backward(gradient)

print(f"x.grad: {x.grad}")  # [2.0, 4.0, 6.0]

Common pattern: Sum output to scalar

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
loss = y.sum()  # Now it's scalar
loss.backward()
print(f"x.grad: {x.grad}")  # [2.0, 4.0, 6.0]

Real Example: Custom Loss Function

I built a custom loss function for a recommendation system that needed to balance accuracy and diversity.

import torch
import torch.nn as nn

class CustomRecommendationLoss(nn.Module):
    """Custom loss combining MSE and diversity penalty."""
    
    def __init__(self, diversity_weight=0.1):
        super().__init__()
        self.diversity_weight = diversity_weight
        self.mse = nn.MSELoss()
    
    def forward(self, predictions, targets, item_embeddings):
        """
        Args:
            predictions: (batch, n_items) - predicted ratings
            targets: (batch, n_items) - actual ratings
            item_embeddings: (n_items, embedding_dim) - item vectors
        """
        # Accuracy loss (MSE)
        accuracy_loss = self.mse(predictions, targets)
        
        # Diversity penalty (want diverse recommendations)
        # Compute similarity between items
        normalized_embeddings = nn.functional.normalize(item_embeddings, dim=1)
        similarity_matrix = torch.matmul(normalized_embeddings, normalized_embeddings.T)
        
        # High diversity = low average similarity
        diversity_penalty = similarity_matrix.mean()
        
        # Combined loss
        total_loss = accuracy_loss + self.diversity_weight * diversity_penalty
        
        return total_loss, accuracy_loss, diversity_penalty

# Usage
batch_size = 32
n_items = 100
embedding_dim = 64

predictions = torch.randn(batch_size, n_items, requires_grad=True)
targets = torch.randn(batch_size, n_items)
item_embeddings = torch.randn(n_items, embedding_dim, requires_grad=True)

loss_fn = CustomRecommendationLoss(diversity_weight=0.1)
total_loss, accuracy_loss, diversity_penalty = loss_fn(predictions, targets, item_embeddings)

print(f"Total loss: {total_loss.item():.4f}")
print(f"Accuracy: {accuracy_loss.item():.4f}")
print(f"Diversity penalty: {diversity_penalty.item():.4f}")

# Backward pass computes gradients for custom loss
total_loss.backward()

print(f"Predictions grad shape: {predictions.grad.shape}")
print(f"Embeddings grad shape: {item_embeddings.grad.shape}")

That's the power of autograd - complex custom losses just work.

Higher-Order Gradients

Computing gradients of gradients (second derivatives):

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 3

# First derivative
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"dy/dx = {dy_dx.item()}")  # 3x² = 12

# Second derivative
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"d²y/dx² = {d2y_dx2.item()}")  # 6x = 12

When I use this: Implementing physics-informed neural networks (PINNs) that enforce differential equations.

Custom Autograd Functions

For operations PyTorch doesn't support or for special backward passes:

import torch

class MyReLU(torch.autograd.Function):
    """Custom ReLU with custom backward."""
    
    @staticmethod
    def forward(ctx, input):
        """Forward pass."""
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """Backward pass."""
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

# Usage
x = torch.randn(5, requires_grad=True)
y = MyReLU.apply(x)
loss = y.sum()
loss.backward()

print(f"x: {x}")
print(f"y: {y}")
print(f"x.grad: {x.grad}")

Gradient Checking

Verify custom gradients with numerical approximation:

import torch

def numerical_gradient(f, x, eps=1e-5):
    """Compute gradient numerically using finite differences."""
    grad = torch.zeros_like(x)
    
    for i in range(x.numel()):
        x_flat = x.flatten()
        
        # f(x + eps)
        x_flat[i] += eps
        f_plus = f(x.reshape(x.shape))
        
        # f(x - eps)
        x_flat[i] -= 2 * eps
        f_minus = f(x.reshape(x.shape))
        
        # Gradient approximation
        x_flat[i] += eps  # Reset
        grad.flatten()[i] = (f_plus - f_minus) / (2 * eps)
    
    return grad

# Test function
def my_function(x):
    return (x ** 2).sum()

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Analytical gradient (autograd)
y = my_function(x)
y.backward()
analytical_grad = x.grad.clone()

# Numerical gradient
x.grad.zero_()
numerical_grad = numerical_gradient(my_function, x)

print(f"Analytical: {analytical_grad}")
print(f"Numerical: {numerical_grad}")
print(f"Difference: {(analytical_grad - numerical_grad).abs().max().item()}")

I use gradient checking when implementing custom layers to verify correctness.

Common Patterns

Training Loop with Autograd

import torch
import torch.nn as nn
import torch.optim as optim

# Model
model = nn.Linear(10, 1)

# Optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Loss function
criterion = nn.MSELoss()

# Training loop
for epoch in range(100):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward pass
    optimizer.zero_grad()  # Clear old gradients
    loss.backward()        # Compute gradients
    optimizer.step()       # Update parameters
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Three critical steps:

zero_grad() - Clear gradients
backward() - Compute gradients
step() - Update parameters

Gradient Clipping

Prevent exploding gradients:

import torch
import torch.nn as nn

model = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters())

for epoch in range(100):
    loss = ...  # Compute loss
    
    optimizer.zero_grad()
    loss.backward()
    
    # Clip gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    optimizer.step()

I use gradient clipping for RNNs and transformers to stabilize training.

Autograd Profiler

Find performance bottlenecks:

import torch
from torch.profiler import profile, ProfilerActivity

model = torch.nn.Linear(1000, 1000).cuda()
input = torch.randn(1000, 1000).cuda()

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    output = model(input)
    loss = output.sum()
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Best Practices

From my experience:

1. Always zero gradients before backward pass:

optimizer.zero_grad()
loss.backward()

2. Use torch.no_grad() for inference:

with torch.no_grad():
    predictions = model(inputs)

3. Detach when needed:

# Use value without backprop
value = tensor.detach()

4. Check for NaN gradients:

loss.backward()
if torch.isnan(model.weight.grad).any():
    print("Warning: NaN gradient!")

5. Use gradient accumulation for large batches:

accumulation_steps = 4

for i, (inputs, targets) in enumerate(dataloader):
    loss = model(inputs, targets) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

6. Save memory with checkpointing:

from torch.utils.checkpoint import checkpoint

# Recompute activations during backward (saves memory)
output = checkpoint(heavy_function, input)

Common Issues

Issue 1: "RuntimeError: Trying to backward through the graph a second time"

Cause: Computational graph is freed after first backward.

Solution: Use retain_graph=True

loss.backward(retain_graph=True)

Or rebuild the graph:

for i in range(10):
    loss = compute_loss()
    loss.backward()  # Graph rebuilt each time

Issue 2: "RuntimeError: element 0 of tensors does not require grad"

Cause: Tensor doesn't have requires_grad=True.

Solution:

x = torch.tensor([1.0, 2.0], requires_grad=True)

Or:

x = torch.tensor([1.0, 2.0])
x.requires_grad_(True)  # In-place

Issue 3: Gradients are None

Cause: Tensor is not a leaf or operation doesn't support gradients.

Solution: Check is_leaf and ensure operations are differentiable.

What's Next?

You now understand autograd - the foundation of training neural networks. In Part 3, we'll use this to build actual neural networks with torch.nn.

Next: Part 3 - Building Neural Networks with torch.nn

Previous: Part 1 - Introduction to PyTorch and Tensors

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

PreviousPart 1: Introduction to PyTorch and Tensors NextPart 3: Building Neural Networks with torch.nn

Last updated 2 days ago

hashtagThe Night I Understood Gradients

hashtagWhat is Automatic Differentiation?

hashtagBasic Autograd

hashtagSimple Example

hashtagMultiple Operations

hashtagComputational Graphs

hashtagLeaf vs Non-Leaf Tensors

hashtagGradient Accumulation

hashtagControlling Gradient Computation

hashtagDetach from Graph

hashtagNo Grad Context

hashtagInference Mode

hashtagGradient for Non-Scalar Outputs

hashtagReal Example: Custom Loss Function

hashtagHigher-Order Gradients

hashtagCustom Autograd Functions

hashtagGradient Checking

hashtagCommon Patterns

hashtagTraining Loop with Autograd

hashtagGradient Clipping

hashtagAutograd Profiler

hashtagBest Practices

hashtagCommon Issues

hashtagIssue 1: "RuntimeError: Trying to backward through the graph a second time"

hashtagIssue 2: "RuntimeError: element 0 of tensors does not require grad"

hashtagIssue 3: Gradients are None

hashtagWhat's Next?