Phase 2 put the maths under everything Phase 1 reasoned about. Eleven stations, one synthesis. The danger after a phase like that is the familiar one: the words feel solid because you've seen them, while the mechanisms underneath are still loose. Recognition masquerades as understanding.
Phase 3 will lean on this directly. Hardware only makes sense as the answer to a demand the wall created: enormous parallel arithmetic, driven by optimisation, on representations that can hold more. If the wall is loose, the machine room reads as a pile of disconnected specs. This calibration is where you find the loose links before they cost you.
This is retrieval, not re-reading. Read the question, close everything else, write your answer, then reveal the model one and compare. The comparison is the work; the gap between your answer and the model is the diagnostic signal.
The exercises run from single-theme mechanism checks, through the five themes in turn, to integrated questions that span the whole wall. The integrated questions at the end matter most: connecting the themes is the thing Phase 2 was really for.
Vectors, distance, dimensions. These ask whether geometry is still a mechanism in your head, not just a picture.
A vector puts meaning at a position in a space, and training pulls inputs that should be treated alike toward nearby positions. Once that geometry exists, an input the model never saw still lands somewhere, and its neighbours are the seen inputs most like it. The model's behaviour at the new point is interpolated from those neighbours, so it transfers without having been told the answer.
That is what generalisation is, mechanically: behaviour spread smoothly across a space rather than stored per example. Without a geometry there is no notion of "near", so there is nothing for behaviour to transfer along; you are left with lookup, which only works on inputs already seen. The shape of the space is what decides how well an unseen input is served.
Distance and cosine similarity take two vectors and return one number for how alike they are. That single scalar is what lets a system ask "what is this like?" instead of "what exactly is this?". Raw storage can only answer exact-match queries; distance answers approximate ones.
Almost every useful operation reduces to it: nearest-neighbour retrieval and semantic search rank by distance, clustering groups by it, a classifier head scores classes by it, and an attention weight is a similarity between a query and a key. The reason it is the most-used operation family in modern AI is that comparison-by-proximity is exactly what generalisation needs, and the scalar is cheap to compute at scale.
Capacity tracks the number of independent directions a representation has (L13). If the task needs more distinctions than the space has directions, different inputs get crammed into overlapping regions and become hard to separate. Classes that should sit apart collapse together.
The damage is irreversible because the merge already happened: once two inputs map to the same (or nearly the same) point, the information that separated them is gone, and no downstream layer can pull them back apart. The symptom is systematic confusion between things that differ along an axis the representation couldn't afford to keep. The fix has to be upstream (a wider representation), not downstream.
Matrices and projections. These check that you can separate reshaping a space from selecting within it, and see why discarding is both useful and dangerous.
A projection keeps a chosen subspace and drops the rest. That is valuable precisely because real inputs carry variation the task doesn't need: sensor noise, nuisance directions, redundancy. Dropping that denoises the input, forces the model toward the structure shared across examples (less to memorise), and saves compute by working in fewer dimensions.
It is the right move when what's discarded is irrelevant to the task, so "kept" and "task-relevant" line up. It is harmful when the projection drops a direction the task actually needed, because then the loss is silent and permanent. The skill is choosing the subspace so the surviving directions are the ones that matter; a learned layer is trained to do exactly that.
A matrix bends and moves the entire space at once: it can rotate, stretch, shear, and re-base a representation into a different form, and when it is full-rank that move can be undone. Its job is transformation, reshaping a representation so the next stage can work with it.
A projection is the special case that collapses the space onto a subspace, keeping some directions and zeroing the rest. It is many-to-one, so it cannot be inverted, and its job is selection, deciding what information survives. A network uses both together: transforms to re-base the geometry, projections to discard what the next stage doesn't need. Conflating them hides the fact that one is reversible reshaping and the other is deliberate, irreversible loss.
If a projection drops a direction the task actually needed, every pair of inputs that differed only along that direction becomes identical to everything downstream. The system then treats them the same: it mis-classifies, conflates two cases, or ignores a feature entirely.
Because the map is many-to-one, no later stage and no prompt can recover the distinction; the information is simply absent. So the failure shape is a systematic, repeatable confusion between things that vary along the discarded axis, plus a blind spot that looks like the model "not caring" about a feature when really it can no longer see it. This is why the choice of what to keep is a design decision with consequences, not a free optimisation.
Probability, distributions, entropy. These check that you read a model's output as a distribution with measurable uncertainty, not as a single answer.
A single answer is one point read off the distribution, usually the most likely one. The distribution holds far more: the weight on every outcome, how close the runner-up is, and how spread out the belief is. Collapsing to one answer discards all of that, which is the same information-discarding move from L15 applied to a belief.
It matters because two predictions can share a top answer while having very different spreads, and they call for different actions: a 0.55 top answer and a 0.95 top answer are not the same situation. The distribution is what tells them apart, lets a system abstain or escalate when unsure, and exposes the second-best option. Keep the distribution until an action forces a single choice.
Confidence is the probability the model puts on its answer; being right is a fact about the world. They coincide only if the model is calibrated, meaning that across the cases it calls 0.9, about 90% are actually correct. Large models are frequently overconfident and uncalibrated, so a confident wrong answer is precisely what you should expect on inputs outside the data they cover or on facts they never grounded.
That is the hallucination mechanism from L10 in this language: the model places high probability mass on a fluent continuation that happens to be false, and reports the mass as confidence. The lesson is that confidence is only as trustworthy as the calibration behind it, so the response is to measure calibration or to ground the prediction, never to trust the number on its own.
High entropy means the model's probability is spread across many outcomes rather than concentrated: it isn't committing, the next outcome is hard to predict, and from the model's view the situation is genuinely uncertain. It is a readout of how much the model doesn't know about the outcome before it lands.
What it doesn't tell you: whether that uncertainty is irreducible (genuine randomness or an underdetermined situation, the aleatoric case from L16) or just the model lacking data or capacity (epistemic), since entropy alone doesn't distinguish the two. And it doesn't mean the prediction is wrong; a high-entropy forecast can be the honest one when the outcome really is open. Entropy measures the spread of the belief, not the correctness of any single draw from it.
Optimisation. These check that you can say why learning needs a loss and what the gradient and the loss curve actually tell you.
Gradient descent moves parameters in the direction that lowers a single number, and that number is the loss. Without a loss there is nothing to take the gradient of, so there is no direction to step and no definition of what "improving" even means. The loss is what turns a vague goal ("be good at the task") into a quantity the optimiser can drive down.
Choosing the loss is choosing what the model is rewarded for getting right; it is the objective from L1 and L19, and for prediction it is the cross-entropy from L18, the surprise the model's distribution assigns to reality. Change the loss and you change what the model becomes. With no loss there is no learning at all, only a fixed function that never updates.
Two among several: a mis-set learning rate, and a reached floor. If the learning rate is too large, steps overshoot and the loss bounces or diverges; too small, and it crawls, barely moving despite many steps. If the model has reached the irreducible (aleatoric) entropy of the data, or run out of capacity or signal, the loss flattens at a sensible non-zero value because there's little left to remove.
You tell them apart from the curve and a nudge. Oscillating or diverging loss points to the learning rate. A smooth flattening near a plausible floor points to saturation or a capacity/data limit. A long flat stretch that drops after a change (momentum, a learning-rate bump) points to a slow region such as a saddle. The gradient's magnitude is a clue: near-zero everywhere means a flat region; very large means instability.
Parallelism and scaling. These check that you hold the limits honestly: gains diminish, hardware has ceilings, and efficiency matters as much as raw scale.
First, the learning itself diminishes. A model captures the most-predictable structure first, so each added increment of compute removes less of the remaining error than the last. And there is a floor: the irreducible (aleatoric) entropy of the data, which no amount of compute drops below. So the curve bends and flattens for reasons internal to the problem.
Second, you can't always convert compute into useful work. Parallelism has a serial fraction and communication overhead (L20), so more hardware yields less-than-proportional speedup; and compute, data, and model size have to grow together, or extra compute saturates a model that's too small or starves on data that's too thin. Unlimited input runs into both a learning floor and a hardware ceiling, so it cannot buy unlimited capability.
When each extra unit of capability from raw scale costs more than the last, anything that gets more capability from the same compute is worth as much as buying more hardware. That makes efficiency a first-class lever: better hardware does more arithmetic per watt, better algorithms reach a good solution in fewer steps, and better architectures get more capability per parameter. Each effectively shifts the scaling curve so a given capability costs less.
Parallelism still matters because the demand for compute is enormous regardless of the returns: optimisation specifies a colossal amount of independent matrix arithmetic, and parallelism is the only way to perform it in finite time. So parallelism makes the scale reachable at all, while efficiency makes the diminishing returns affordable. Modern progress is the combination, which is why "AI is just brute force" misses half the story.
These are the most important questions in the calibration. Each one requires connecting across themes. If the short questions felt fine but these break, your themes are solid in isolation and loose where they meet, which is exactly the thing to fix before Phase 3.
Representation: tokens become embedding vectors in a learned space whose geometry encodes similarity (L11–L13). Transformation: each layer applies matrices, attention's query/key/value projections and the feed-forward maps, that reshape and select within that space (L14–L15).
Uncertainty: the model's output at each step is a probability distribution over the whole vocabulary, and its entropy says how sure it is (L16–L18). Learning: training minimises cross-entropy, the surprise the distribution assigns to the real next token, by gradient descent walking the loss downhill (L18–L19). Scale: the whole thing runs across thousands of parallel accelerators (L20) and its capability follows a diminishing-returns scaling curve (L21). Five themes, one system, all present at once.
Representation: users and items become vectors in a shared space where closeness predicts affinity (L11–L13). Transformation: matrix factorisation, a low-rank projection, reshapes the sparse interaction data into those compact user and item vectors (L14–L15).
Uncertainty: the system outputs a distribution over candidate items, how likely you engage with each; a peaked distribution means a confident single guess, a spread one means it's unsure (L16–L18). Learning: it is fitted by minimising a loss over observed engagements via gradient descent (L19). Scale: served across millions of users and items, where parallelism makes ranking tractable and scaling governs how much more data and parameters help (L20–L21). The same five themes, different slots; if the pattern transfers, the wall has stuck.
The connection is that the quantity optimisation minimises is a measure of uncertainty. A model's prediction is a distribution (theme 3); cross-entropy measures how surprised that distribution is by what actually happened, high when it put little weight on the truth and low when it expected it. That cross-entropy is the loss (theme 4), so each gradient step reshapes the distribution to be less surprised by reality, which is the same as lowering its entropy about the true outcome.
It is structural, not incidental, because without a measure of uncertainty there would be no loss to descend in the first place. And the floor from L16 matters: cross-entropy cannot drop below the data's own irreducible entropy, so optimisation closes the gap to that floor rather than reaching zero. Uncertainty supplies the ruler; optimisation supplies the motion along it.
Scaling adds compute, but compute only helps if there is something able to absorb it. Capacity comes first: extra compute lets you build a larger model only if larger representations (more dimensions, L13) can encode more structure; pour compute into a too-small representation and it saturates. So the payoff of scale is bounded by the representation's ability to hold more.
The operations matter too. The transformations are matrix multiplications (L14), independent arithmetic that parallel hardware accelerates (L20); that is why these architectures scale at all. A system whose core operation didn't parallelise, or whose representation couldn't grow, would gain little from more compute. So scaling is the reward for having representations that can hold more and transformations the hardware can run in parallel; "add compute to anything" fails because most things can't turn compute into capability the way matmul over a learned geometry can.
Take an LLM predicting the next token after "The capital of France is". Representation: the prior tokens are embedding vectors in a learned space (L11–L13). Transformation: attention and feed-forward matrices reshape and combine them, projecting onto the directions that matter for the continuation (L14–L15). Prediction: the output is a distribution over the vocabulary, sharply peaked on "Paris", so low entropy and high confidence (L16–L18).
Why that distribution: training minimised cross-entropy against real text by gradient descent, which pulled the weight onto "Paris" because it followed that phrase overwhelmingly in the data (L18–L19). Why the model can do this at all: it was trained at a scale only parallel hardware makes possible, on a diminishing-returns curve where this much compute bought broad competence (L20–L21). One token, all five themes, no magic, just the chain.
This is where the calibration earns its name. You are not scoring yourself; you are sorting your answers into where the reasoning held and where it broke, and reading off what to do next.
Place yourself on the spectrum honestly. Most readers sit across two zones, strong on some themes and shaky on a connection or two. The map below turns "shaky" into "revisit this station".
| If this broke | Revisit | Why it matters for Phase 3 and beyond |
|---|---|---|
| Geometry → generalisation, or distance (C2.1, C2.2) | L11 · L12 · L13 | Every later representation is a vector geometry; if "near means similar" isn't mechanism for you, embeddings and attention stay symbols. |
| Matrices vs projections (C2.4, C2.5, C2.6) | L14 · L15 | Architectures are stacks of transforms and projections; confusing reshape with discard hides where information is lost. |
| Distribution vs single answer, calibration (C2.7, C2.8) | L16 · L17 | Model confidence and uncertainty estimation in deployment depend on reading outputs as distributions, not labels. |
| What entropy tells you (C2.9) | L18 | Cross-entropy is the training loss for almost everything ahead; if entropy is fuzzy, the loss is fuzzy. |
| Why learning needs a loss / why it stalls (C2.10, C2.11) | L19 | All of training and scaling is optimisation; if the loss-and-gradient picture is loose, Phase 5 floats. |
| Why scaling has limits (C2.12, C2.13) | L20 · L21 | Phase 3 is the hardware that delivers the compute; the diminishing-returns story is what makes that hardware matter. |
| The integrated traces broke (C2.14–C2.18) | S2 synthesis | S2 is the worked example of connecting the themes. If the joins broke, re-walk the wall as one picture there. |
Phase 2 explained the ideas. It ends on a demand it can't satisfy alone: this needs an enormous amount of arithmetic, run in parallel, and the cost lives in the hardware. Phase 3 walks through the heavy door into the machine room and takes up that hardware, CPUs, GPUs, memory, bandwidth, accelerators, servers, clusters, and datacentres, as the answer to the demand the wall created.
You don't need any of that yet to pass through this checkpoint. You need the wall to hold: representation, transformation, uncertainty, learning, scale, connected into one machine. If it does, the machine room will read as the silicon underneath ideas you already understand, not as a fresh pile of specifications.
The wall is behind you and the bench is clear. Through the heavy door is the server bay, where Phase 3 shows the machines that turn these foundations into working systems. Cross when the five themes hold together on their own; revisit the fuzzy stations first if they don't.