Lesson 19. Ninth station on the whiteboard wall. ~24 min read + cards + retrieval. Durability tier 1 (bedrock; training is optimisation, and optimisation is following the slope downhill).
⛰️
Memory palace · Whiteboard wall · station 19
The slope and valley: a mountainous landscape drawn in contour lines, past the fog machine from L18. A glowing ball sits high on a slope; at every step an arrow points down the steepest descent, and the ball rolls until it settles in a valley. Keep the mapping: gradient = slope; optimisation = following the slope downhill; training = repeated downhill movement.
Core idea. Training is optimisation: a repeated process of measuring how wrong a model is and nudging it in the direction that reduces the error. The slope of the error, the gradient, says which way is uphill; the model steps the opposite way, downhill, and repeats. There is no magic step in the loop.
Why this lesson exists
L18 ended on a clean statement of a model's job: lower the surprise it faces on real data, the cross-entropy between its distribution and the world. That tells you what to aim for. It doesn't tell you how a model gets there.
A fresh model predicts badly. Its distribution is wrong, so the error is high. To improve, it has to change its parameters in a way that lowers that error, then do it again, and again. The process that does this has a name, and it's the same process whether the model has a thousand parameters or a trillion: optimisation. This lesson is the mechanism behind the word "learning."
The loss function
Start with the thing being minimised. A loss function is a single number that measures how wrong the model currently is on the data it's shown. High loss means badly wrong; low loss means close. Training is the search for parameters that make the loss small.
The number is concrete in every setting. For a classifier, the loss is high when it puts weight on the wrong class and low when it's confident and correct. For next-token prediction, the loss is exactly the cross-entropy from L18: high when the model is surprised by the real next token, low when it expected it. For a weather model, the loss measures how far its predicted probabilities sat from what actually happened. Different tasks, same idea: one number for "how wrong," and the goal is to push it down.
Hold one distinction from the start, because it's the one people blur. The loss is the height (a single value, how wrong you are). The gradient, coming next, is the slope (a direction, which way the height changes fastest). They are different objects: one is a number, the other is a direction.
The mountain analogy
Picture every possible setting of the model's parameters as a position on a landscape, and the loss at that setting as the height of the ground there. A model that's badly wrong sits high on a mountainside; a model that predicts well sits low in a valley. Changing the parameters moves you across the landscape; lowering the loss means moving downhill.
This is the optimisation landscape, and training is a walk across it toward lower ground. The catch is that the walker can't see the landscape. It can't look down from above and spot the lowest valley. All it can feel is the slope right under its feet, here, now. Everything optimisation does has to be built from that purely local information.
flowchart LR
M["model current parameters"]:::vec --> P["prediction"]:::kept --> E["error (loss) how wrong"]:::warn --> U[["update step downhill"]]:::model --> M2["better model"]:::good
U -.->|repeat| M
classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef kept fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef warn fill:#1d2230,stroke:#f87171,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 19.1. The training loop. The model predicts, the loss measures how wrong it was, the update nudges the parameters downhill, and the loop repeats. Each pass produces a slightly better model. This is the same input-representation-output-feedback loop from L1, with the feedback now pointed at the parameters.
What a gradient actually is
Standing somewhere on the landscape, the gradient answers one question: if I take a tiny step, which direction increases the loss fastest? It points in the steepest uphill direction, and its size says how steep the climb is. On flat ground the gradient is near zero; on a sharp slope it's large.
That's all you need, because the direction that increases loss fastest, reversed, is the direction that decreases it fastest. The gradient points uphill; step the opposite way and you go downhill as steeply as the landscape allows. No view from above is required. The local slope, negated, is the best available guess at "which way is down."
flowchart TB
H["high loss badly wrong"]:::warn --> Mday["medium loss"]:::model --> L["low loss predicts well"]:::good
classDef warn fill:#1d2230,stroke:#f87171,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 19.2. The optimisation landscape, top to bottom. Training moves from high loss toward low loss. The gradient at each height is what tells the model which way that "down" actually is.
Gradient descent
Put those together and you get the algorithm that trains almost every modern model. It's called gradient descent, and the loop is short: at your current position, compute the gradient, take a step in the opposite (downhill) direction, and repeat from the new position.
flowchart LR
P["current position parameters"]:::vec --> G[["compute gradient (uphill slope)"]]:::model --> S[["step downhill (opposite the gradient)"]]:::model --> N["new position lower loss"]:::good
N -.->|repeat| P
classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 19.3. Gradient descent. Measure the slope, step the opposite way, move to a lower point, and do it again. Thousands or millions of these small steps are what "training" actually consists of.
How big a step? That's the learning rate, and it matters more than it looks. Take one giant leap and you'll overshoot the valley, maybe land higher than you started, maybe bounce around forever. Take steps that are too small and training crawls, burning compute to move almost nowhere. Many small, well-sized steps beat one huge one, because the gradient is only trustworthy as a local guess: it tells you which way is down right here, not how far down stays down. Real training schedules usually start with larger steps and shrink them as the model nears a valley, for exactly this reason.
FIG 19.4. The loss curve and the downhill walk. The ball starts high (random initial parameters) and steps downhill, settling in the nearest valley, here a local minimum rather than the global one. The gradient (green) points uphill; the step (amber) goes the opposite way. The one-dimensional picture makes local minima look like a trap, which is the historical worry the next section corrects.
Why training takes so long
The loop is simple. The cost comes from its size. A frontier model has hundreds of billions of parameters, so the landscape it walks has hundreds of billions of dimensions, and the gradient is a direction in every one of them at once. Computing that gradient means running data through the model and back; doing it well means averaging over enormous datasets, often trillions of tokens. And one step barely moves you, so training takes hundreds of thousands to millions of steps.
Multiply those together and the bill is enormous: a single frontier training run can occupy thousands of accelerators for weeks (mid-2020s scale). Training is expensive because optimisation is expensive, and optimisation is expensive because the landscape is vast, the data is large, and progress comes one small step at a time. This is the cost that the rest of the phase, parallelism and compute scaling, exists to manage.
Local minima and saddle points
The one-dimensional curve above raises an old fear. If the ball rolls into a valley that isn't the lowest one, it gets stuck: every direction out is uphill, so the gradient says stop, even though a better solution exists elsewhere. A point like that, lower than everything nearby but not the lowest overall, is a local minimum. The lowest point on the whole landscape is the global minimum. In one dimension, local minima look like a serious trap.
For years this was thought to be the central obstacle to training big models. The modern picture is more reassuring, and the reason is dimensionality. In a landscape with billions of dimensions, for a point to be a local minimum, every single one of those billions of directions has to go uphill at once. That's rare. Far more common is a saddle point: a place where the slope is flat but the ground goes down in some directions and up in others. A saddle isn't a trap, because a downhill direction still exists; optimisation just has to find it, and techniques like momentum help it slide off.
Two more things make high-dimensional optimisation work better than the 1D picture suggests. The local minima that do exist tend to sit at losses close to the global minimum, so landing in one of them is usually fine. And very large networks have many different parameter settings that all predict well, so there are many good valleys to find rather than one needle to thread. None of this means optimisation reaches the global minimum, or that the result is ever perfect. It means the realistic goal, a good solution, is usually reachable, and that bad local minima are much less of a problem than the 1D cartoon implies.
flowchart TB
S["a flat-slope point"]:::model --> Q{"every nearby direction uphill?"}:::node
Q -->|yes, all of them| LM["local minimum a real valley (rare in high-D)"]:::warn
Q -->|no, some go down| SA["saddle point down in some directions (common)"]:::good
SA -.->|find a downhill direction| OUT["keep descending"]:::good
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef node fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef warn fill:#1d2230,stroke:#f87171,color:#e6e8ee;
classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 19.5. Flat ground, two cases. Where the slope vanishes, the model has either reached a local minimum (every direction uphill, a genuine valley) or a saddle point (some directions still go down). In billions of dimensions, requiring every direction to go up at once is rare, so most flat-slope points are saddles, which optimisation can escape.
From optimisation to capability
Step back and the whole chain of the phase lines up. A representation (L11 to L15) gives the model a space to compute in. The model turns an input into a prediction, a distribution over outcomes (L16, L17). The loss measures how surprised that distribution is by reality (L18). Optimisation changes the parameters to lower that loss. Repeat across enormous data, and the parameters settle into a configuration that predicts well across the whole task.
flowchart LR
R["random start untrained model"]:::warn --> U[["many updates gradient descent"]]:::model --> B["better solution low loss, predicts well"]:::good
classDef warn fill:#1d2230,stroke:#f87171,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 19.6. The training process, compressed. A randomly initialised model becomes a capable one through nothing more exotic than many downhill steps. The capability is a side effect of relentlessly lowering the loss; this is what the L1 objective function meant by "adjust the parameters to make the score better."
What looks like learning, then, is repeated optimisation. The model isn't trying, wanting, or understanding; an optimiser is computing slopes and nudging parameters, millions of times, and capable behaviour falls out of the parameters that survive the process. Reading it that way is what keeps "the model learned to do X" from turning into a mystery: the mechanism is always measure the error, find the downhill direction, take a step.
Optimisation across engineering
The same move runs far beyond neural networks. A weather model is tuned by minimising the gap between its forecasts and recorded outcomes. A recommender is fitted by minimising a loss over which items people actually engaged with. Image recognisers, language models, and control systems are all trained by writing down a loss and walking it downhill. Outside AI, engineers minimise cost, weight, drag, or energy against constraints using the same gradient-following idea, often under the name "optimisation" with no machine learning in sight. Wherever there's a measurable objective and knobs to turn, following the slope toward a better value is the workhorse method.
Compute spectrum: the same loop, different scale
Gradient descent is the same algorithm at every tier; what changes is how much of it you can afford.
microcontroller
Usually no on-device training at all: the model is optimised elsewhere and shipped frozen. Any on-device adaptation is a few cheap steps on a tiny model.
mobile / edge
Light fine-tuning and on-device personalisation: a small number of gradient steps on a modest model, often on a subset of the parameters to keep the cost down.
workstation
Full training of small and mid-size models, and fine-tuning of larger ones. The optimisation loop runs comfortably; dataset size and step count set the wall-clock time.
hyperscale
Frontier pretraining: the same loop over hundreds of billions of parameters and trillions of tokens, sharded across thousands of accelerators. The algorithm is unchanged; the engineering is all about feeding it.
What this lesson does and doesn't do
It doesn't show how the gradient is computed through a deep network; that's backpropagation, and it gets its own lesson in the architectures phase. It doesn't cover the specific optimisers (momentum, Adam, and their relatives) beyond noting they exist to make the downhill walk faster and steadier. It doesn't promise that optimisation finds the best possible model; it finds a good one, usually, and that is a different and more honest claim.
What it does do is install the engine. From here, "the model was trained" reads as "an optimiser walked the loss downhill for a long time," and questions like why training costs what it does, why learning rate matters, and why bigger models still train at all all have the same root.
compression · what to carry forward
Training is optimisation: measure how wrong the model is (the loss), then change parameters to make it less wrong, and repeat.
The loss is the height (one number, how wrong); the gradient is the slope (a direction, which way the loss rises fastest). They are different objects.
The gradient points uphill; gradient descent steps the opposite way, downhill, using only the local slope, with no view of the whole landscape.
The learning rate is the step size: too large overshoots, too small crawls; many small steps beat one big one.
An optimisation landscape places every parameter setting at a height equal to its loss; training is a downhill walk across it.
A local minimum is low but not lowest; the global minimum is the lowest overall; in high dimensions saddle points dominate and bad local minima are rare.
Optimisation finds a good solution, not a guaranteed-best or perfect one.
Training is expensive because the landscape is vast (billions of dimensions), the data is large, and progress comes one small step at a time.
What you should be able to do now
Explain why training is a search for low loss across a landscape, using only the local slope.
Distinguish the loss (a height, one number) from the gradient (a direction, the slope).
Say why the gradient points uphill and why descent steps the opposite way.
Explain what the learning rate controls and why small steps usually beat one big step.
Distinguish a local minimum from the global minimum, and a saddle point from both.
Say why bad local minima worry people less now, without claiming optimisation finds the global minimum.
Explain why training large models is so expensive, in terms of dimensions, data, and step count.
Flashcards
Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.
Retrieval practice
Write before you reveal. Trace mechanism; don't summarise.
L19 Explain optimisation using the mountain-and-valley analogy. What is the landscape, what is height, and what can the walker actually sense?
Picture every possible setting of the model's parameters as a position on a landscape, and the loss at that setting as the height of the ground there. A badly wrong model sits high on a mountainside (high loss); a model that predicts well sits low in a valley (low loss). Training is a walk across this landscape trying to reach lower ground. The crucial limit is what the walker can sense: not the whole landscape from above, and not where the lowest valley is, but only the slope directly under its feet, right here and now. So optimisation has to be built entirely from local information. At each position it measures the slope (the gradient), steps in the downhill direction (opposite the gradient), arrives at a slightly lower point, and repeats. Many small steps carry it down into a valley. Because it only ever feels the local slope, it can settle in a valley that isn't the lowest one; in the very high-dimensional landscapes real models use, that's usually still a good place to be. The analogy captures the whole method: height is error, moving is changing parameters, and progress is feeling the slope and stepping downhill over and over.
L19 Explain why gradients help a model improve. Be precise about what the gradient is and why descent moves the opposite way.
The gradient at the model's current parameters is the direction in which the loss increases fastest, together with how steep that increase is. It's a purely local fact: it describes the slope right here, not the shape of the whole landscape. That single piece of information is enough to improve the model, because the direction that increases the loss fastest, reversed, is the direction that decreases it fastest. So gradient descent computes the gradient (steepest uphill) and then steps the opposite way (steepest downhill), which lowers the loss as much as any small step can. After the step the parameters are slightly better, the model predicts slightly more accurately, and the process repeats from the new position. The reason this is powerful is that it needs no global view: you can't see where the best model lives, but you can always measure the slope under your feet and move down it. The reason it's limited is the same: because it only follows the local slope, it descends into whatever valley is nearby, which need not be the lowest one. Keep the distinction sharp: the loss is the height the model is trying to reduce; the gradient is the direction it reads off to know which way "down" is.
↩ L18 Interleaved. How does entropy connect to optimisation and prediction? Trace the link from the distribution to the gradient step.
The chain runs distribution, then surprise, then loss, then gradient. A model's prediction is a distribution over outcomes (L17). Entropy and its relative cross-entropy (L18) measure surprise: cross-entropy is the average surprise the model's distribution assigns to what actually happened, high when the model put little weight on the real outcome, low when it expected it. That cross-entropy is exactly the loss that optimisation minimises for classifiers and language models. So "predicting better" and "lower loss" and "less surprise on real data" are three names for the same thing. Optimisation is what drives it down: the gradient of the cross-entropy loss with respect to the parameters points in the direction that would increase surprise fastest, and the model steps the opposite way, shifting probability toward the outcomes that actually occur. Each gradient step reshapes the model's distribution to be a little less surprised by reality, which is a little lower in entropy about the true outcome. There's also a floor, the irreducible (aleatoric) uncertainty in the data itself (L16): cross-entropy can't drop below the data's own entropy, so optimisation closes the gap to that floor rather than reaching zero. Entropy, in short, supplies the ruler (how wrong, in bits), and optimisation supplies the motion (step downhill on that ruler).
L19 A teammate worries that training "will just get stuck in a local minimum." Give the modern, realistic response, and distinguish local minima, the global minimum, and saddle points.
First the definitions. A local minimum is a point lower than everything in its immediate neighbourhood, so every nearby direction goes uphill and the gradient says stop, even though a lower point may exist elsewhere. The global minimum is the lowest point on the entire landscape. A saddle point is a place where the slope is flat but the ground goes down in some directions and up in others, so it isn't a trap: a downhill direction still exists. The worry comes from the one-dimensional picture, where a local minimum really does trap the ball. The realistic response is that real models optimise in landscapes with billions of dimensions, and for a point to be a local minimum every one of those billions of directions has to go uphill at once, which is rare. Flat-slope points are far more often saddles, which optimisation can slide off, especially with momentum. On top of that, the local minima that do occur tend to sit at losses close to the global minimum, and very large networks have many good solutions rather than one, so landing in a nearby valley is usually fine. The honest caveat: none of this guarantees reaching the global minimum or a perfect model. It means the practical goal, a good-enough solution, is usually reachable, and "stuck in a bad local minimum" is much less of a problem than the simple picture suggests.
↳ ahead Training is one simple loop run an enormous number of times. Predict why the next stretch of the course turns to parallelism and hardware rather than to a cleverer algorithm.
Because the bottleneck isn't the algorithm, it's the sheer volume of the same cheap operation. Gradient descent is simple and hasn't changed in spirit for decades; what changed is scale. A frontier run computes gradients over hundreds of billions of parameters, averaged across trillions of tokens, for hundreds of thousands of steps. Each gradient is built mostly from matrix multiplications (L14), the same operation repeated astronomically often. When a workload is one cheap operation done a colossal number of times, the way to go faster is to do many copies of it at once rather than to invent a smarter step. That is what parallelism is: spread the identical work across many processing units running simultaneously. So the next lesson turns to why GPUs, with thousands of small units doing the same operation in lockstep, fit this workload far better than a general-purpose CPU, and the lessons after that to how that compute scales. Optimisation explains how a model improves; hardware explains why it can improve at the scale that makes modern AI work. The story shifts from the maths of the downhill step to the machines that take billions of those steps in a reasonable amount of time.
Next station
You now have the engine of learning: measure the loss, follow the slope downhill, repeat. The loop is simple, and that turns out to be the point. The thing that makes it work for a model with hundreds of billions of parameters is being able to run the same arithmetic an enormous number of times, fast. That demand is what pulled hardware to the centre of modern AI. The next stretch of the wall is about meeting it: parallelism first, then the accelerators and the compute scaling that follow. L20 puts the conveyor belt on the wall.