Part 2: Autograd and Automatic Differentiation

Part of the PyTorch 101 Series

The Night I Understood Gradients

I spent three days debugging a neural network. The loss wouldn't decrease. I'd made a sign error in the gradient calculation - a minus instead of plus in line 47 of my 200-line backward pass.

That's when I tried PyTorch's autograd. One line: loss.backward(). All gradients computed correctly, automatically.

That bug taught me: never manually calculate gradients again.

Autograd is PyTorch's superpower. Let me show you how it works.

What is Automatic Differentiation?

Automatic differentiation (autograd) computes derivatives automatically by tracking operations on tensors.

Why it matters:

  • Neural networks learn by computing gradients

  • Manual gradient calculation is error-prone

  • Autograd handles complex architectures automatically

The magic: Every operation on tensors with requires_grad=True is tracked in a computational graph.

Basic Autograd

Simple Example

Output:

Verify manually: y = x² + 3x + 1, dy/dx = 2x + 3 = 2(2) + 3 = 7 ✓

Multiple Operations

Under the hood:

  • x = 3

  • a = 3 * 2 = 6

  • b = 6 + 5 = 11

  • c = 11² = 121

  • y = 121

Gradient (chain rule):

  • dy/dc = 1

  • dc/db = 2b = 22

  • db/da = 1

  • da/dx = 2

  • dy/dx = 1 * 22 * 1 * 2 = 44

Perfect!

Computational Graphs

When you perform operations on tensors, PyTorch builds a computational graph.

Graph structure:

Gradients flow backward through this graph using chain rule.

Leaf vs Non-Leaf Tensors

Leaf tensors: Created by user (requires_grad=True) Non-leaf tensors: Results of operations

Gradient Accumulation

By default, gradients accumulate (don't overwrite).

In training loops, always zero gradients:

Controlling Gradient Computation

Detach from Graph

Use case: When you want to use a value but not backpropagate through it.

No Grad Context

I use no_grad for inference - saves memory and speeds up computation.

Inference Mode

Difference: inference_mode disables more autograd machinery - faster but less flexible.

Gradient for Non-Scalar Outputs

When output is not a scalar, you need to specify gradients.

Common pattern: Sum output to scalar

Real Example: Custom Loss Function

I built a custom loss function for a recommendation system that needed to balance accuracy and diversity.

That's the power of autograd - complex custom losses just work.

Higher-Order Gradients

Computing gradients of gradients (second derivatives):

When I use this: Implementing physics-informed neural networks (PINNs) that enforce differential equations.

Custom Autograd Functions

For operations PyTorch doesn't support or for special backward passes:

Gradient Checking

Verify custom gradients with numerical approximation:

I use gradient checking when implementing custom layers to verify correctness.

Common Patterns

Training Loop with Autograd

Three critical steps:

  1. zero_grad() - Clear gradients

  2. backward() - Compute gradients

  3. step() - Update parameters

Gradient Clipping

Prevent exploding gradients:

I use gradient clipping for RNNs and transformers to stabilize training.

Autograd Profiler

Find performance bottlenecks:

Best Practices

From my experience:

1. Always zero gradients before backward pass:

2. Use torch.no_grad() for inference:

3. Detach when needed:

4. Check for NaN gradients:

5. Use gradient accumulation for large batches:

6. Save memory with checkpointing:

Common Issues

Issue 1: "RuntimeError: Trying to backward through the graph a second time"

Cause: Computational graph is freed after first backward.

Solution: Use retain_graph=True

Or rebuild the graph:

Issue 2: "RuntimeError: element 0 of tensors does not require grad"

Cause: Tensor doesn't have requires_grad=True.

Solution:

Or:

Issue 3: Gradients are None

Cause: Tensor is not a leaf or operation doesn't support gradients.

Solution: Check is_leaf and ensure operations are differentiable.

What's Next?

You now understand autograd - the foundation of training neural networks. In Part 3, we'll use this to build actual neural networks with torch.nn.

Next: Part 3 - Building Neural Networks with torch.nn


Previous: Part 1 - Introduction to PyTorch and Tensors

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

Last updated