Build Track · B4

Gradient descent visualiser

Build a 2D loss surface, derive its gradient by hand, and write the gradient-descent update rule yourself. Then watch the optimiser walk: drop a starting point on the surface, run the loop, and plot the trajectory over a contour map. Sweep the learning rate to see crawling, converging, and diverging, and toggle momentum to see it smooth the path through a valley.

Phase: 2, after L19 Time: ~2 to 2.5 hours Tier: 1 (any laptop, CPU) Tooling: numpy and matplotlib Status: optional (depth-by-choice)
where this sits B4 attaches to L19 (Gradients and optimisation landscapes). It is optional and not required to continue. It closes Phase 2's optimisation thread the way B3 closed the representation thread: by making the lesson physical.
before you start B4 needs Python with numpy and matplotlib. If you do not have them, see Installing packages. New to running a script? Python setup and Running Python cover it, and Reading errors helps when something throws. The numpy and plotting one-liners you need are in Python basics and the Python cheatsheet. There is no data file and no model to download: the surface is math, so the whole build is self-contained.

Summary

You build a 2D loss surface, derive its gradient, and write the gradient-descent update yourself. Then you run the optimiser from a fixed start and plot the path it takes over a contour map of the surface. You sweep the learning rate to watch a small rate crawl, a mid rate converge, and a large rate diverge, and you turn momentum on and off to see it cut a straighter path through a narrow valley. The surface is a few lines of numpy, the optimiser is your code, and the plot is where the behaviour becomes obvious.

Learning goals

Prerequisites

Estimated time

About 2 to 2.5 hours: roughly half an hour on the surface and its gradient, an hour on the update rule and the trajectory plot, and the rest on the learning-rate and momentum sweeps.

Deliverables

Suggested file structure

builds/B4/
  descent.py        # surface + gradient + step + loop + contour/trajectory plot
  trajectory.png    # the optimiser path over the contours
  lr_sweep.png      # trajectories or loss curves for several learning rates
  README.md         # update rule, the lr that worked, what momentum did

One script is plenty. There is no data file: unlike B3, B4 is fully self-contained because the surface is a function.

Step-by-step instructions

  1. Define a surface. Start with an elongated bowl, f(x, y) = 0.5 * (A*x**2 + B*y**2) with A and B far apart (say 1 and 20). Evaluate it on a grid with np.meshgrid and draw a contour plot so you can see the valley.
  2. Write the gradient. Derive the slope by hand. For this bowl it is [A*x, B*y]. Sanity-check it points away from the minimum.
  3. Write the update rule. step(p, v, lr, momentum): compute the gradient, update the velocity, step downhill, return the new point and velocity.
  4. Run the loop. From a fixed start, record every point into a trajectory. Plot the trajectory over the contours.
  5. Sweep the learning rate. Try 0.001, 0.01, 0.1, 0.5. Plot the trajectories or the loss-versus-step curves side by side. Find the one that crawls, the one that converges, and the one that diverges.
  6. Toggle momentum. Run the same start with momentum 0 and 0.9 on the elongated bowl. Compare the paths.
  7. Guard against divergence. Stop the loop if the loss climbs or goes non-finite, and report it rather than plotting garbage.
  8. Record a table. Learning rate, momentum, converged yes or no, steps to a loss threshold, final loss.
  9. Optional saddle. Swap in the saddle surface below, start just off-centre, and watch the optimiser slide away from the flat point. Keep this short.
  10. Write the README.

Starter skeleton

Two functions carry the optimiser and are left for you to write. Everything else is scaffolding. Writing gradient and step yourself is the milestone.

import numpy as np
import matplotlib.pyplot as plt

A, B = 1.0, 20.0                       # an elongated bowl: a narrow valley along x

def f(p):                              # scaffolding: the surface
    x, y = p
    return 0.5 * (A * x**2 + B * y**2)

def gradient(p):
    # TODO (you write this): the slope vector at p.
    # for this bowl the gradient is [A*x, B*y]
    ...

def step(p, v, lr, momentum):
    # TODO (you write this): one optimiser update.
    #   velocity = momentum * v - lr * gradient(p)
    #   new_p    = p + velocity
    #   return new_p, velocity        (momentum = 0 gives plain gradient descent)
    ...

def run(start, lr, steps=200, momentum=0.0):   # scaffolding: the loop + recording
    p = np.array(start, dtype=float)
    v = np.zeros(2)
    traj = [p.copy()]
    for _ in range(steps):
        p, v = step(p, v, lr, momentum)
        traj.append(p.copy())
        if not np.isfinite(f(p)):              # divergence guard
            print("diverged at step", len(traj))
            break
    return np.array(traj)

def plot_contours(traj):                       # scaffolding: meshgrid + contour + path
    xs = np.linspace(-3, 3, 200)
    ys = np.linspace(-3, 3, 200)
    X, Y = np.meshgrid(xs, ys)
    Z = 0.5 * (A * X**2 + B * Y**2)
    plt.contour(X, Y, Z, levels=30)
    plt.plot(traj[:, 0], traj[:, 1], marker=".")
    plt.xlabel("x"); plt.ylabel("y"); plt.title("descent trajectory")
    plt.savefig("trajectory.png")

For the optional saddle demonstration, swap in a second surface and its gradient. Keep it brief; it shows a saddle point, it is not the main convergence exercise:

# a saddle: down in y, up in x. start just off-centre and watch it slide away.
def f_saddle(p):
    x, y = p
    return x**2 - y**2

def grad_saddle(p):
    x, y = p
    return np.array([2*x, -2*y])     # flat at the origin, but y still goes downhill

Expected output

Your exact numbers depend on the surface constants and the start point; these are illustrative, not targets:

A results table makes the sweep legible:

lr      momentum   converged   steps   final loss
0.001   0.0        no (crawl)   200      4.83
0.02    0.0        yes          138      0.001
0.02    0.9        yes           41      0.000
0.20    0.0        no (diverged) 7        inf

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.

COMPLETE The gradient and the update rule are your own code, the trajectory plot is sensible, the learning-rate sweep shows crawl, convergence, and divergence, momentum visibly changes the path, and you can explain why the step subtracts the gradient and why too large a rate diverges.
RUNS-NOT-UNDERSTOOD It runs, but you cannot yet say why the step negates the gradient, or you treat the contour plot as something the optimiser can see. Re-read L19 and trace one step by hand. Do not mark COMPLETE.
TOOL-LOCKED The gradient or the descent came from an optimisation library (PyTorch autograd, scipy.optimize) rather than your numpy. The milestone is to build the optimiser, so reframe it to your own code before marking complete. Such libraries belong only in the optional extensions.
INCOMPLETE Unfinished, or the optimiser diverges and the cause is not found yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, distinct from code symptoms.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

SymptomLikely cause
trajectory shoots to inf or NaNLearning rate too large, or the gradient sign is wrong. Lower the rate and check the step subtracts the gradient.
the path climbs uphillThe step adds the gradient instead of subtracting it.
the path barely movesRate far too small, or the gradient is miscomputed as near-zero.
contour and trajectory do not line upx and y swapped between the surface grid and the trajectory plot.
momentum path oscillates foreverMomentum too high for the learning rate. Lower one of them.
the contour plot is emptyForgot meshgrid, or evaluated the surface on 1D arrays instead of the grid.

Optional extensions

why this build exists L19 says training is walking downhill on a loss surface, and that the optimiser only feels the local slope while the 1D cartoon makes minima look like traps. B4 makes that physical: you derive the slope, write the downhill step, and watch the path. The learning-rate sweep turns "step size matters" from a sentence into a plot where one setting crawls and another explodes. And the contour view shows the central honesty of L19: you see the whole surface, the optimiser never does. It also sets the pattern the later training builds reuse: sweep a hyperparameter, record a results table, plot the comparison, and watch for divergence.