Part 4: Training and Optimization

Part of the PyTorch 101 Series

My First Training Loop

I trained my first model on a Tuesday afternoon. Left it running, checked the next morning: 89% accuracy!

Felt like magic.

Then on real data: 23% accuracy. Worse than random guessing (25% for 4 classes).

What went wrong? Everything about training:

  • Wrong learning rate (too high)

  • Wrong optimizer (basic SGD)

  • No learning rate scheduling

  • Data loading bottleneck

  • No validation split

Spent a week debugging. Now I know training is 80% of the work. The model architecture is easy - training it well is hard.

Let me share what I learned.

The Training Loop

Every PyTorch training loop follows this pattern:

This is the foundation. Everything else builds on it.

Complete Training Example

Here's my actual training code (simplified):

This is production-ready code I use in all my projects.

Data Loading

Critical for training speed. Bad data loading = GPU sitting idle.

Creating Dataset

DataLoader

I learned the hard way: num_workers=0 (single process) made my training 5x slower.

Real Image Dataset

Optimizers

The optimizer updates model weights. Choice matters!

SGD (Stochastic Gradient Descent)

Basic but effective with momentum.

Adam

My default choice. Adaptive learning rates, works well most of the time.

AdamW

Better weight decay than Adam. I use this for transformers.

Comparing Optimizers

Rule of thumb:

  • CNNs: SGD with momentum or Adam

  • Transformers: AdamW

  • RNNs: Adam or RMSprop

  • When in doubt: Adam

Learning Rate Scheduling

Fixed learning rate rarely works best. I use schedulers in every project.

Step Decay

Reduces LR every step_size epochs by factor gamma.

Cosine Annealing

My favorite for long training. Smooth decay, good results.

Reduce on Plateau

Reduces LR when validation loss stops improving. I use this when I don't know optimal schedule.

One Cycle Policy

Fast training with super-convergence. Got me 30% faster training on image classification.

Warm-up Schedule

I use warm-up for large models - prevents early instability.

Handling Common Training Issues

1. Exploding Gradients

Essential for RNNs/LSTMs.

2. Class Imbalance

3. Overfitting

4. Slow Training

Got me 2-3x speedup on my image classifier with minimal code changes.

Production Training Pipeline

Here's my complete production setup:

This is what I use. Handles everything:

  • Mixed precision training

  • Gradient clipping

  • Learning rate scheduling

  • Checkpointing

  • Proper train/val split

Monitoring Training

TensorBoard

Run TensorBoard:

Weights & Biases

I use Weights & Biases for all experiments. Great for tracking hyperparameters and comparing runs.

Best Practices

From training hundreds of models:

1. Always use validation set:

2. Monitor multiple metrics:

3. Save checkpoints:

4. Use reproducible seeds:

5. Profile training:

What I Learned

Training is an iterative process:

  1. Start with reasonable defaults (Adam, lr=0.001)

  2. Train a few epochs

  3. Check for issues (overfitting, underfitting, slow convergence)

  4. Adjust (learning rate, regularization, data augmentation)

  5. Repeat

No single recipe works for everything. But the patterns above solve 90% of problems.

What's Next?

You now know how to train models effectively. In Part 5, we'll learn how to deploy these models to production.

Next: Part 5 - Production Deployment and Best Practices


Previous: Part 3 - Building Neural Networks with torch.nn

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

Last updated