Part 3: Calculus and Optimization

Part of the Mathematics for Programming 101 Series

The Debugging Session That Taught Me Calculus

I was training a neural network for image classification. Loss was decreasing, then suddenly spiking to infinity.

Epoch 1: loss = 2.305
Epoch 2: loss = 1.823
Epoch 3: loss = 1.456
Epoch 4: loss = 1.203
Epoch 5: loss = inf

Stack Overflow said: "Lower your learning rate."

I tried 0.01, 0.001, 0.0001. Still exploded. Sometimes earlier, sometimes later, but always the same infinity.

Then I actually looked at the gradients. They were growing exponentially through the layers. The problem wasn't the learning rateβ€”it was gradient explosion in deep networks.

The solution: Gradient clipping, better initialization, and residual connections.

But here's the key: I only understood the problem when I understood the calculus. Derivatives weren't just theoryβ€”they were the diagnostic tool I needed.

What is Calculus in Programming?

Calculus is the mathematics of change. In programming:

  • Derivatives = Rate of change = Slope = Gradient

  • Integrals = Accumulation = Total change over time

You use calculus when:

  • Training machine learning models

  • Optimizing any function

  • Understanding algorithm behavior

  • Simulating physics or motion

  • Analyzing trends and patterns

Derivatives: Measuring Change

The Intuition

A derivative measures: "If I change the input slightly, how much does the output change?"

Key insight: The derivative at a point is the slope of the tangent line at that point.

Common Derivatives

Chain Rule: The Key to Deep Learning

The chain rule is how backpropagation works.

If you have composed functions: h(x) = f(g(x)), then:

Why this matters for neural networks:

Partial Derivatives and Gradients

When you have multiple inputs, you need partial derivatives.

Gradient = direction of steepest ascent Negative gradient = direction of steepest descent (for minimization)

Gradient Descent: Optimizing Functions

The algorithm that powers ML training:

Multidimensional Gradient Descent

Real Example: Linear Regression from Scratch

Backpropagation: Chain Rule in Action

Building a simple neural network with backpropagation:

Optimization Algorithms

Momentum

Problem: Gradient descent oscillates in ravines. Solution: Accumulate velocity from past gradients.

Adam Optimizer

Combines momentum and adaptive learning rates:

Debugging with Calculus

Gradient Checking

Verify backpropagation implementation:

Diagnosing Training Issues

Key Takeaways

  • Derivatives measure rates of changeβ€”essential for optimization

  • Chain rule enables backpropagation in neural networks

  • Gradients point in direction of steepest ascent (negative for descent)

  • Gradient descent iteratively minimizes functions

  • Advanced optimizers (momentum, Adam) converge faster and more reliably

  • Gradient checking verifies backpropagation implementations

  • Understanding calculus helps debug training issues

What's Next

In the next article, we'll explore probability and statisticsβ€”the mathematics behind uncertainty, decision-making, A/B testing, and recommendation systems.

You'll learn:

  • Probability distributions in practice

  • Bayesian thinking for better decisions

  • Statistical testing that actually works

  • Building anomaly detection systems

Continue to Part 4: Probability and Statistics β†’


Last updated