Build Track · B4

Gradient descent visualiser

Build a 2D loss surface, derive its gradient by hand, and write the gradient-descent update rule yourself. Then watch the optimiser walk: drop a starting point on the surface, run the loop, and plot the trajectory over a contour map. Sweep the learning rate to see crawling, converging, and diverging, and toggle momentum to see it smooth the path through a valley.

Phase: 2, after L19 Time: ~2 to 2.5 hours Tier: 1 (any laptop, CPU) Tooling: numpy and matplotlib Status: optional (depth-by-choice)

where this sits B4 attaches to L19 (Gradients and optimisation landscapes). It is optional and not required to continue. It closes Phase 2's optimisation thread the way B3 closed the representation thread: by making the lesson physical.

before you start B4 needs Python with numpy and matplotlib. If you do not have them, see Installing packages. New to running a script? Python setup and Running Python cover it, and Reading errors helps when something throws. The numpy and plotting one-liners you need are in Python basics and the Python cheatsheet. There is no data file and no model to download: the surface is math, so the whole build is self-contained.

Summary

You build a 2D loss surface, derive its gradient, and write the gradient-descent update yourself. Then you run the optimiser from a fixed start and plot the path it takes over a contour map of the surface. You sweep the learning rate to watch a small rate crawl, a mid rate converge, and a large rate diverge, and you turn momentum on and off to see it cut a straighter path through a narrow valley. The surface is a few lines of numpy, the optimiser is your code, and the plot is where the behaviour becomes obvious.

Learning goals

Feel that a gradient is the slope vector at a point, and descent is "step the opposite way, repeat".
See the learning rate decide everything: too small crawls, too large overshoots or diverges, just right converges.
Watch momentum smooth the zigzag through a narrow valley.
See the gap between what the optimiser knows (the local slope at one point) and what you see (the whole surface in the plot).
Recognise divergence as a numerical event (loss climbing, values going to inf or NaN), not a mystery.

Prerequisites

L19 (gradients, descent, learning rate, minima and saddle points).
B3 for numpy and matplotlib comfort. Depth-by-choice: skipping this does not block conceptual progress.

Estimated time

About 2 to 2.5 hours: roughly half an hour on the surface and its gradient, an hour on the update rule and the trajectory plot, and the rest on the learning-rate and momentum sweeps.

Deliverables

descent.py: the surface, the hand-written gradient, the update rule, the trajectory loop, and the contour plot.
One or two plot artefacts: a trajectory over the contours, and a learning-rate comparison.
README.md: the update rule in your own words, the learning rate that converged, and what momentum changed.

Suggested file structure

builds/B4/
  descent.py        # surface + gradient + step + loop + contour/trajectory plot
  trajectory.png    # the optimiser path over the contours
  lr_sweep.png      # trajectories or loss curves for several learning rates
  README.md         # update rule, the lr that worked, what momentum did

One script is plenty. There is no data file: unlike B3, B4 is fully self-contained because the surface is a function.

Step-by-step instructions

Define a surface. Start with an elongated bowl, f(x, y) = 0.5 * (A*x**2 + B*y**2) with A and B far apart (say 1 and 20). Evaluate it on a grid with np.meshgrid and draw a contour plot so you can see the valley.
Write the gradient. Derive the slope by hand. For this bowl it is [A*x, B*y]. Sanity-check it points away from the minimum.
Write the update rule. step(p, v, lr, momentum): compute the gradient, update the velocity, step downhill, return the new point and velocity.
Run the loop. From a fixed start, record every point into a trajectory. Plot the trajectory over the contours.
Sweep the learning rate. Try 0.001, 0.01, 0.1, 0.5. Plot the trajectories or the loss-versus-step curves side by side. Find the one that crawls, the one that converges, and the one that diverges.
Toggle momentum. Run the same start with momentum 0 and 0.9 on the elongated bowl. Compare the paths.
Guard against divergence. Stop the loop if the loss climbs or goes non-finite, and report it rather than plotting garbage.
Record a table. Learning rate, momentum, converged yes or no, steps to a loss threshold, final loss.
Optional saddle. Swap in the saddle surface below, start just off-centre, and watch the optimiser slide away from the flat point. Keep this short.
Write the README.

Starter skeleton

Two functions carry the optimiser and are left for you to write. Everything else is scaffolding. Writing gradient and step yourself is the milestone.

import numpy as np
import matplotlib.pyplot as plt

A, B = 1.0, 20.0                       # an elongated bowl: a narrow valley along x

def f(p):                              # scaffolding: the surface
    x, y = p
    return 0.5 * (A * x**2 + B * y**2)

def gradient(p):
    # TODO (you write this): the slope vector at p.
    # for this bowl the gradient is [A*x, B*y]
    ...

def step(p, v, lr, momentum):
    # TODO (you write this): one optimiser update.
    #   velocity = momentum * v - lr * gradient(p)
    #   new_p    = p + velocity
    #   return new_p, velocity        (momentum = 0 gives plain gradient descent)
    ...

def run(start, lr, steps=200, momentum=0.0):   # scaffolding: the loop + recording
    p = np.array(start, dtype=float)
    v = np.zeros(2)
    traj = [p.copy()]
    for _ in range(steps):
        p, v = step(p, v, lr, momentum)
        traj.append(p.copy())
        if not np.isfinite(f(p)):              # divergence guard
            print("diverged at step", len(traj))
            break
    return np.array(traj)

def plot_contours(traj):                       # scaffolding: meshgrid + contour + path
    xs = np.linspace(-3, 3, 200)
    ys = np.linspace(-3, 3, 200)
    X, Y = np.meshgrid(xs, ys)
    Z = 0.5 * (A * X**2 + B * Y**2)
    plt.contour(X, Y, Z, levels=30)
    plt.plot(traj[:, 0], traj[:, 1], marker=".")
    plt.xlabel("x"); plt.ylabel("y"); plt.title("descent trajectory")
    plt.savefig("trajectory.png")

For the optional saddle demonstration, swap in a second surface and its gradient. Keep it brief; it shows a saddle point, it is not the main convergence exercise:

# a saddle: down in y, up in x. start just off-centre and watch it slide away.
def f_saddle(p):
    x, y = p
    return x**2 - y**2

def grad_saddle(p):
    x, y = p
    return np.array([2*x, -2*y])     # flat at the origin, but y still goes downhill

Expected output

Your exact numbers depend on the surface constants and the start point; these are illustrative, not targets:

Trajectory. The path curves downhill into the minimum on the contour plot.
Learning-rate sweep. A small rate barely moves, a mid rate settles cleanly, a large rate overshoots or blows up.
Momentum. On the elongated bowl, vanilla descent zigzags across the valley while momentum cuts a straighter, faster path.
Divergence. Above some learning rate the loss climbs to inf or NaN and the guard stops the loop.

A results table makes the sweep legible:

lr      momentum   converged   steps   final loss
0.001   0.0        no (crawl)   200      4.83
0.02    0.0        yes          138      0.001
0.02    0.9        yes           41      0.000
0.20    0.0        no (diverged) 7        inf

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.

COMPLETE The gradient and the update rule are your own code, the trajectory plot is sensible, the learning-rate sweep shows crawl, convergence, and divergence, momentum visibly changes the path, and you can explain why the step subtracts the gradient and why too large a rate diverges.

RUNS-NOT-UNDERSTOOD It runs, but you cannot yet say why the step negates the gradient, or you treat the contour plot as something the optimiser can see. Re-read L19 and trace one step by hand. Do not mark COMPLETE.

TOOL-LOCKED The gradient or the descent came from an optimisation library (PyTorch autograd, scipy.optimize) rather than your numpy. The milestone is to build the optimiser, so reframe it to your own code before marking complete. Such libraries belong only in the optional extensions.

INCOMPLETE Unfinished, or the optimiser diverges and the cause is not found yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, distinct from code symptoms.

Stepping the wrong way. The gradient points uphill. The step must subtract it. Add it and the optimiser climbs.
Reading the plot as the optimiser's knowledge. You see the whole surface. The optimiser only ever feels the local slope at its current point.
Expecting one learning rate to work everywhere. The right rate depends on the surface. The bowl's two axes even prefer different rates, which is the point of the valley.
Thinking momentum is free. On a simple bowl too much momentum overshoots and oscillates.
Confusing convergence with the global minimum. On a multi-minimum surface, the start point decides which basin you land in.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

Symptom	Likely cause
trajectory shoots to inf or NaN	Learning rate too large, or the gradient sign is wrong. Lower the rate and check the step subtracts the gradient.
the path climbs uphill	The step adds the gradient instead of subtracting it.
the path barely moves	Rate far too small, or the gradient is miscomputed as near-zero.
contour and trajectory do not line up	x and y swapped between the surface grid and the trajectory plot.
momentum path oscillates forever	Momentum too high for the learning rate. Lower one of them.
the contour plot is empty	Forgot `meshgrid`, or evaluated the surface on 1D arrays instead of the grid.

Optional extensions

A harder surface. Add Rosenbrock (a curved banana valley) or Himmelblau (four equal minima, where the start point decides the outcome). Their gradients are fiddly, so write them out carefully or use the numerical gradient below.
Numerical gradient. Approximate the slope by finite differences (central difference) so any surface plugs in without hand-derivation. Pure numpy, no autograd.
Other update rules. Add Nesterov momentum, RMSProp, or Adam and compare their paths against plain descent. Implement them by hand, not from a library.
Animation. Animate the descent step by step instead of drawing the whole path at once.
Basin map. Vary the start point across a grid and colour each by which minimum it reaches.
Loss curves. Plot loss versus step for several learning rates on one axis.

why this build exists L19 says training is walking downhill on a loss surface, and that the optimiser only feels the local slope while the 1D cartoon makes minima look like traps. B4 makes that physical: you derive the slope, write the downhill step, and watch the path. The learning-rate sweep turns "step size matters" from a sentence into a plot where one setting crawls and another explodes. And the contour view shows the central honesty of L19: you see the whole surface, the optimiser never does. It also sets the pattern the later training builds reuse: sweep a hyperparameter, record a results table, plot the comparison, and watch for divergence.

← L19 · Gradients Syllabus Continue to L20 →