PHASE 2 → 3 · CALIBRATION
C2 · calibration stop

Calibration. Representation, uncertainty, and scale

Calibration assessment C2. Between Phase 2 and Phase 3. ~40 min with honest retrieval. Durability tier 1 (the checkpoint that decides whether the whiteboard wall actually attached, before the machines arrive).

🔧
Memory palace · Central workbench · calibration stop
You pause at the bench before the heavy door to the server bay. The whiteboard wall is behind you, eleven stations of it. Before crossing into the machine room, you check one thing: that you can read the maths off the wall without the lessons open, and that the five themes still connect into one system in your head.
Core idea. Calibration is a diagnostic, not an exam. It asks one question: when you reason through the five themes of the wall on your own, do the mechanisms hold and the connections between them survive, or do they break? You are checking your own model before the hardware phase builds on it. There are no marks here, only a map of what to revisit.

Why this lesson exists

Phase 2 put the maths under everything Phase 1 reasoned about. Eleven stations, one synthesis. The danger after a phase like that is the familiar one: the words feel solid because you've seen them, while the mechanisms underneath are still loose. Recognition masquerades as understanding.

Phase 3 will lean on this directly. Hardware only makes sense as the answer to a demand the wall created: enormous parallel arithmetic, driven by optimisation, on representations that can hold more. If the wall is loose, the machine room reads as a pile of disconnected specs. This calibration is where you find the loose links before they cost you.

How to approach this

This is retrieval, not re-reading. Read the question, close everything else, write your answer, then reveal the model one and compare. The comparison is the work; the gap between your answer and the model is the diagnostic signal.

The exercises run from single-theme mechanism checks, through the five themes in turn, to integrated questions that span the whole wall. The integrated questions at the end matter most: connecting the themes is the thing Phase 2 was really for.

theme 1 Representations L11-L13 how is itrepresented? theme 2 Transformations L14-L15 how is ittransformed? theme 3 Uncertainty L16-L18 how is itmeasured? theme 4 Learning L19 how does itimprove? theme 5 Scale L20-L21 why is itpractical?
FIG C2.1. What this calibration checks. The five themes of the wall and the question each answers. Use it as a map: for each theme, ask whether you can explain its mechanism from memory, and where the connections between themes hold or break.

Section 1. Representations

Vectors, distance, dimensions. These ask whether geometry is still a mechanism in your head, not just a picture.

C2.1 mechanism · geometry → generalisation
Why does representing things as points in a vector space enable generalisation? Trace the mechanism; don't just assert that it does.
Reveal model answer
Strong answer demonstrates: generalisation lives in the geometry: training places similar inputs near each other, so a model's response to an unseen input is governed by the seen inputs nearby.

A vector puts meaning at a position in a space, and training pulls inputs that should be treated alike toward nearby positions. Once that geometry exists, an input the model never saw still lands somewhere, and its neighbours are the seen inputs most like it. The model's behaviour at the new point is interpolated from those neighbours, so it transfers without having been told the answer.

That is what generalisation is, mechanically: behaviour spread smoothly across a space rather than stored per example. Without a geometry there is no notion of "near", so there is nothing for behaviour to transfer along; you are left with lookup, which only works on inputs already seen. The shape of the space is what decides how well an unseen input is served.

If this felt weak: L11 (vectors), L12 (distance & similarity), L13 (dimensions). Walk the arrow on the board, the pinned map, and the layered grids.
C2.2 mechanism · why distance is useful
Why is distance (or similarity) such a useful operation? What does turning a comparison into one number let a system do that raw storage doesn't?
Reveal model answer
Strong answer demonstrates: distance collapses a comparison of two high-dimensional vectors to a single scalar, which is the workhorse behind retrieval, clustering, classification, and attention, none of which need exact matching.

Distance and cosine similarity take two vectors and return one number for how alike they are. That single scalar is what lets a system ask "what is this like?" instead of "what exactly is this?". Raw storage can only answer exact-match queries; distance answers approximate ones.

Almost every useful operation reduces to it: nearest-neighbour retrieval and semantic search rank by distance, clustering groups by it, a classifier head scores classes by it, and an attention weight is a similarity between a query and a key. The reason it is the most-used operation family in modern AI is that comparison-by-proximity is exactly what generalisation needs, and the scalar is cheap to compute at scale.

If this felt weak: L12 (distance & similarity). The pinned map: dot product, cosine, and where meaning becomes geometry.
C2.3 transfer · too few dimensions
What happens to a representation that has too few dimensions for its task? Predict the failure and explain why it can't be fixed downstream.
Reveal model answer
Strong answer demonstrates: too few dimensions means too little capacity, so distinct things are forced to share coordinates; the collapse is irreversible and no later layer can recover what the representation merged.

Capacity tracks the number of independent directions a representation has (L13). If the task needs more distinctions than the space has directions, different inputs get crammed into overlapping regions and become hard to separate. Classes that should sit apart collapse together.

The damage is irreversible because the merge already happened: once two inputs map to the same (or nearly the same) point, the information that separated them is gone, and no downstream layer can pull them back apart. The symptom is systematic confusion between things that differ along an axis the representation couldn't afford to keep. The fix has to be upstream (a wider representation), not downstream.

If this felt weak: L13 (dimensions). The layered grids: effective vs nominal dimensionality and capacity.

Section 2. Transformations

Matrices and projections. These check that you can separate reshaping a space from selecting within it, and see why discarding is both useful and dangerous.

C2.4 mechanism · projection despite discarding
Why are projections useful despite the fact that they throw information away? When is discarding the right move, and when is it harmful?
Reveal model answer
Strong answer demonstrates: discarding is the point: dropping noise, redundancy, and irrelevant directions aids generalisation and cuts cost; it is right when the discarded directions carry no task-relevant signal, and harmful when they do.

A projection keeps a chosen subspace and drops the rest. That is valuable precisely because real inputs carry variation the task doesn't need: sensor noise, nuisance directions, redundancy. Dropping that denoises the input, forces the model toward the structure shared across examples (less to memorise), and saves compute by working in fewer dimensions.

It is the right move when what's discarded is irrelevant to the task, so "kept" and "task-relevant" line up. It is harmful when the projection drops a direction the task actually needed, because then the loss is silent and permanent. The skill is choosing the subspace so the surviving directions are the ones that matter; a learned layer is trained to do exactly that.

If this felt weak: L15 (projections). The stack of transparent sheets: selective preservation.
C2.5 comparison · matrices vs projections
Matrices and projections are both linear maps. What different roles do they play in a model? Be precise about how they differ.
Reveal model answer
Strong answer demonstrates: a matrix is the general transform that reshapes or moves a whole space (often invertibly); a projection is the specific lossy map that reduces onto a subspace (many-to-one, non-invertible). One transforms; the other selects.

A matrix bends and moves the entire space at once: it can rotate, stretch, shear, and re-base a representation into a different form, and when it is full-rank that move can be undone. Its job is transformation, reshaping a representation so the next stage can work with it.

A projection is the special case that collapses the space onto a subspace, keeping some directions and zeroing the rest. It is many-to-one, so it cannot be inverted, and its job is selection, deciding what information survives. A network uses both together: transforms to re-base the geometry, projections to discard what the next stage doesn't need. Conflating them hides the fact that one is reversible reshaping and the other is deliberate, irreversible loss.

If this felt weak: L14 (matrices) and L15 (projections). The grid (bend the space) versus the sheets (drop directions).
C2.6 diagnosis · mistakes from discarding
What kinds of mistakes happen when a system discards information that turns out to matter? Describe the shape of the failure, not just that it fails.
Reveal model answer
Strong answer demonstrates: dropping a task-relevant direction makes distinct inputs indistinguishable downstream, producing systematic confusion and a blind spot the model cannot be prompted out of.

If a projection drops a direction the task actually needed, every pair of inputs that differed only along that direction becomes identical to everything downstream. The system then treats them the same: it mis-classifies, conflates two cases, or ignores a feature entirely.

Because the map is many-to-one, no later stage and no prompt can recover the distinction; the information is simply absent. So the failure shape is a systematic, repeatable confusion between things that vary along the discarded axis, plus a blind spot that looks like the model "not caring" about a feature when really it can no longer see it. This is why the choice of what to keep is a design decision with consequences, not a free optimisation.

If this felt weak: L15 (projections). The sheets again: what survives the filter is all the model has.

Section 3. Uncertainty

Probability, distributions, entropy. These check that you read a model's output as a distribution with measurable uncertainty, not as a single answer.

C2.7 comparison · distribution vs single answer
Why is a distribution more informative than a single predicted answer? What exactly do you lose by collapsing it to one answer?
Reveal model answer
Strong answer demonstrates: the distribution carries the spread, the alternatives, and the confidence; the single answer is a lossy summary, and downstream decisions need the part you dropped.

A single answer is one point read off the distribution, usually the most likely one. The distribution holds far more: the weight on every outcome, how close the runner-up is, and how spread out the belief is. Collapsing to one answer discards all of that, which is the same information-discarding move from L15 applied to a belief.

It matters because two predictions can share a top answer while having very different spreads, and they call for different actions: a 0.55 top answer and a 0.95 top answer are not the same situation. The distribution is what tells them apart, lets a system abstain or escalate when unsure, and exposes the second-best option. Keep the distribution until an action forces a single choice.

If this felt weak: L16 (probability) and L17 (distributions). The dice rack and the funnel: the landscape is the answer.
C2.8 diagnosis · confident but wrong
Can a confident prediction be wrong in a way you should have expected? Explain using calibration, and connect it to a failure you've already met.
Reveal model answer
Strong answer demonstrates: confidence is the probability the model assigns, which equals correctness only when the model is calibrated; an overconfident model on ungrounded or out-of-distribution input is exactly the hallucination shape.

Confidence is the probability the model puts on its answer; being right is a fact about the world. They coincide only if the model is calibrated, meaning that across the cases it calls 0.9, about 90% are actually correct. Large models are frequently overconfident and uncalibrated, so a confident wrong answer is precisely what you should expect on inputs outside the data they cover or on facts they never grounded.

That is the hallucination mechanism from L10 in this language: the model places high probability mass on a fluent continuation that happens to be false, and reports the mass as confidence. The lesson is that confidence is only as trustworthy as the calibration behind it, so the response is to measure calibration or to ground the prediction, never to trust the number on its own.

If this felt weak: L16 (probability) (confidence and calibration) and L18 (entropy). Recall the hallucination trace from L10.
C2.9 mechanism · high entropy
What does high entropy in a model's output distribution tell you, and just as importantly, what does it not tell you?
Reveal model answer
Strong answer demonstrates: high entropy means a spread-out distribution and so much remaining uncertainty / low predictability; it does not tell you the answer is wrong, nor by itself separate irreducible from reducible uncertainty.

High entropy means the model's probability is spread across many outcomes rather than concentrated: it isn't committing, the next outcome is hard to predict, and from the model's view the situation is genuinely uncertain. It is a readout of how much the model doesn't know about the outcome before it lands.

What it doesn't tell you: whether that uncertainty is irreducible (genuine randomness or an underdetermined situation, the aleatoric case from L16) or just the model lacking data or capacity (epistemic), since entropy alone doesn't distinguish the two. And it doesn't mean the prediction is wrong; a high-entropy forecast can be the honest one when the outcome really is open. Entropy measures the spread of the belief, not the correctness of any single draw from it.

If this felt weak: L18 (entropy) and L16 (probability). The fog machine: how much fog remains, not which road is right.

Section 4. Learning

Optimisation. These check that you can say why learning needs a loss and what the gradient and the loss curve actually tell you.

C2.10 mechanism · why a loss
Why does optimisation require a loss function? What would happen without one?
Reveal model answer
Strong answer demonstrates: the loss is the only signal the optimiser has and the thing that defines "better"; without it there is no gradient, no direction to move, and no notion of improvement.

Gradient descent moves parameters in the direction that lowers a single number, and that number is the loss. Without a loss there is nothing to take the gradient of, so there is no direction to step and no definition of what "improving" even means. The loss is what turns a vague goal ("be good at the task") into a quantity the optimiser can drive down.

Choosing the loss is choosing what the model is rewarded for getting right; it is the objective from L1 and L19, and for prediction it is the cross-entropy from L18, the surprise the model's distribution assigns to reality. Change the loss and you change what the model becomes. With no loss there is no learning at all, only a fixed function that never updates.

If this felt weak: L19 (optimisation) (the slope and valley) and L18 (entropy) (cross-entropy as the loss).
C2.11 diagnosis · why a model fails to improve
Give two distinct mechanistic reasons a model might stop improving during training, and say how you'd tell them apart from the loss curve.
Reveal model answer
Strong answer demonstrates: for example a mis-set learning rate, a reached floor (saturation / the irreducible entropy of the data), a capacity or data limit, or a slow region of the landscape; each leaves a different signature on the loss curve.

Two among several: a mis-set learning rate, and a reached floor. If the learning rate is too large, steps overshoot and the loss bounces or diverges; too small, and it crawls, barely moving despite many steps. If the model has reached the irreducible (aleatoric) entropy of the data, or run out of capacity or signal, the loss flattens at a sensible non-zero value because there's little left to remove.

You tell them apart from the curve and a nudge. Oscillating or diverging loss points to the learning rate. A smooth flattening near a plausible floor points to saturation or a capacity/data limit. A long flat stretch that drops after a change (momentum, a learning-rate bump) points to a slow region such as a saddle. The gradient's magnitude is a clue: near-zero everywhere means a flat region; very large means instability.

If this felt weak: L19 (optimisation). Learning rate, minima and saddle points, and why local minima are rarely the trap.

Section 5. Scale

Parallelism and scaling. These check that you hold the limits honestly: gains diminish, hardware has ceilings, and efficiency matters as much as raw scale.

C2.12 mechanism · why scaling isn't unlimited
Why doesn't adding more compute produce unlimited gains? Name the mechanisms, not just the phrase "diminishing returns".
Reveal model answer
Strong answer demonstrates: two forces: diminishing returns in the learning (the easy structure is captured first, and there is an irreducible entropy floor), and limits on turning compute into work (the serial fraction and communication cost, plus the need for data and model size to grow together).

First, the learning itself diminishes. A model captures the most-predictable structure first, so each added increment of compute removes less of the remaining error than the last. And there is a floor: the irreducible (aleatoric) entropy of the data, which no amount of compute drops below. So the curve bends and flattens for reasons internal to the problem.

Second, you can't always convert compute into useful work. Parallelism has a serial fraction and communication overhead (L20), so more hardware yields less-than-proportional speedup; and compute, data, and model size have to grow together, or extra compute saturates a model that's too small or starves on data that's too thin. Unlimited input runs into both a learning floor and a hardware ceiling, so it cannot buy unlimited capability.

If this felt weak: L21 (scaling) (the lever and machine), with L20 (parallelism) for the hardware ceiling and L18 (entropy)/L16 (probability) for the floor.
C2.13 transfer · the consequence of diminishing returns
Given diminishing returns, what becomes valuable beyond raw scale, and why does parallelism still matter so much?
Reveal model answer
Strong answer demonstrates: efficiency (better hardware, algorithms, architectures) becomes as valuable as more hardware because raw scale is expensive; parallelism still matters because it is the only way to deliver the enormous compute optimisation demands at all.

When each extra unit of capability from raw scale costs more than the last, anything that gets more capability from the same compute is worth as much as buying more hardware. That makes efficiency a first-class lever: better hardware does more arithmetic per watt, better algorithms reach a good solution in fewer steps, and better architectures get more capability per parameter. Each effectively shifts the scaling curve so a given capability costs less.

Parallelism still matters because the demand for compute is enormous regardless of the returns: optimisation specifies a colossal amount of independent matrix arithmetic, and parallelism is the only way to perform it in finite time. So parallelism makes the scale reachable at all, while efficiency makes the diminishing returns affordable. Modern progress is the combination, which is why "AI is just brute force" misses half the story.

If this felt weak: L20 (parallelism) and L21 (scaling). The conveyor belt delivers the compute; the lever's returns fade.

Section 6. Integrated reasoning

These are the most important questions in the calibration. Each one requires connecting across themes. If the short questions felt fine but these break, your themes are solid in isolation and loose where they meet, which is exactly the thing to fix before Phase 3.

the integration test: run this chain on one real prediction represent a point in a space predict a distribution measure the error entropy reduce it optimisation at scale parallel hardware if any link in the chain breaks when you try it, that is the theme to revisit.
FIG C2.2. The integrated reasoning map. The five themes assemble into one process: represent, predict, measure the error, reduce it, do it at scale. The hardest calibration questions ask you to run this whole chain on a single concrete system.
C2.14 integrated · a language model through all five themes
Explain a large language model using all five themes of the wall. Every theme should appear with its mechanism, not just its name.
Reveal model answer
Strong answer demonstrates: representation (token embeddings and their geometry), transformation (attention and feed-forward matrices, projections), uncertainty (the next-token distribution and its entropy), learning (cross-entropy minimised by gradient descent), and scale (parallel hardware on a diminishing-returns curve).

Representation: tokens become embedding vectors in a learned space whose geometry encodes similarity (L11–L13). Transformation: each layer applies matrices, attention's query/key/value projections and the feed-forward maps, that reshape and select within that space (L14–L15).

Uncertainty: the model's output at each step is a probability distribution over the whole vocabulary, and its entropy says how sure it is (L16–L18). Learning: training minimises cross-entropy, the surprise the distribution assigns to the real next token, by gradient descent walking the loss downhill (L18–L19). Scale: the whole thing runs across thousands of parallel accelerators (L20) and its capability follows a diminishing-returns scaling curve (L21). Five themes, one system, all present at once.

If this felt weak: S2 (synthesis), then whichever theme's station felt thin when you tried to name its mechanism.
C2.15 integrated · a recommender through the whole wall
Trace a recommendation system through the entire wall, theme by theme. Use a different system from the language model so the pattern, not the example, is what transfers.
Reveal model answer
Strong answer demonstrates: users and items as vectors (representation), matrix factorisation as a low-rank projection (transformation), a distribution over what you'll engage with and its uncertainty, a loss minimised by optimisation, and the scale of millions of users.

Representation: users and items become vectors in a shared space where closeness predicts affinity (L11–L13). Transformation: matrix factorisation, a low-rank projection, reshapes the sparse interaction data into those compact user and item vectors (L14–L15).

Uncertainty: the system outputs a distribution over candidate items, how likely you engage with each; a peaked distribution means a confident single guess, a spread one means it's unsure (L16–L18). Learning: it is fitted by minimising a loss over observed engagements via gradient descent (L19). Scale: served across millions of users and items, where parallelism makes ranking tractable and scaling governs how much more data and parameters help (L20–L21). The same five themes, different slots; if the pattern transfers, the wall has stuck.

If this felt weak: S2 (synthesis). If a theme didn't map onto the recommender, that station is the one to revisit.
C2.16 integrated · how uncertainty drives optimisation
Describe how uncertainty (theme 3) influences optimisation (theme 4). Why is the connection structural rather than incidental?
Reveal model answer
Strong answer demonstrates: the loss optimisation minimises is itself a measure of uncertainty: cross-entropy is the surprise of the model's distribution against reality, and gradient descent drives that surprise down toward the data's irreducible floor.

The connection is that the quantity optimisation minimises is a measure of uncertainty. A model's prediction is a distribution (theme 3); cross-entropy measures how surprised that distribution is by what actually happened, high when it put little weight on the truth and low when it expected it. That cross-entropy is the loss (theme 4), so each gradient step reshapes the distribution to be less surprised by reality, which is the same as lowering its entropy about the true outcome.

It is structural, not incidental, because without a measure of uncertainty there would be no loss to descend in the first place. And the floor from L16 matters: cross-entropy cannot drop below the data's own irreducible entropy, so optimisation closes the gap to that floor rather than reaching zero. Uncertainty supplies the ruler; optimisation supplies the motion along it.

If this felt weak: L18 (entropy), L19 (optimisation), and L16 (probability). The fog, the slope, and the irreducible floor.
C2.17 integrated · why scaling depends on representations and transformations
Explain why scaling depends on representations and transformations. Why can't you simply "add compute" to any system and expect gains?
Reveal model answer
Strong answer demonstrates: compute only buys gains if the architecture can use it: representations must have the capacity to encode more, and transformations must be the kind of operation parallel hardware can run; scaling rests on the lower themes.

Scaling adds compute, but compute only helps if there is something able to absorb it. Capacity comes first: extra compute lets you build a larger model only if larger representations (more dimensions, L13) can encode more structure; pour compute into a too-small representation and it saturates. So the payoff of scale is bounded by the representation's ability to hold more.

The operations matter too. The transformations are matrix multiplications (L14), independent arithmetic that parallel hardware accelerates (L20); that is why these architectures scale at all. A system whose core operation didn't parallelise, or whose representation couldn't grow, would gain little from more compute. So scaling is the reward for having representations that can hold more and transformations the hardware can run in parallel; "add compute to anything" fails because most things can't turn compute into capability the way matmul over a learned geometry can.

If this felt weak: L13 (dimensions), L14 (matrices), L20 (parallelism), L21 (scaling). Scale sits on top of capacity and matmul.
C2.18 integrated · the whole wall on one prediction
The integration test. Pick one concrete prediction a model makes and trace it through all five themes, from representation to the scale that made it possible.
Reveal model answer
Strong answer demonstrates: the full chain on one instance: represent the input, transform and select, produce a distribution, explain its error and the optimisation that shaped it, and the scale behind the model's competence.

Take an LLM predicting the next token after "The capital of France is". Representation: the prior tokens are embedding vectors in a learned space (L11–L13). Transformation: attention and feed-forward matrices reshape and combine them, projecting onto the directions that matter for the continuation (L14–L15). Prediction: the output is a distribution over the vocabulary, sharply peaked on "Paris", so low entropy and high confidence (L16–L18).

Why that distribution: training minimised cross-entropy against real text by gradient descent, which pulled the weight onto "Paris" because it followed that phrase overwhelmingly in the data (L18–L19). Why the model can do this at all: it was trained at a scale only parallel hardware makes possible, on a diminishing-returns curve where this much compute bought broad competence (L20–L21). One token, all five themes, no magic, just the chain.

If this felt weak: S2 (synthesis), and the weakest link in your own trace, that is the station to revisit before Phase 3.

Section 7. Reflection

This is where the calibration earns its name. You are not scoring yourself; you are sorting your answers into where the reasoning held and where it broke, and reading off what to do next.

ready for the machine room
  • You traced the integrated questions cleanly, with every theme appearing as a mechanism.
  • You read a model's output as a distribution with entropy, not a single answer.
  • You separated transforming a space (matrices) from selecting within it (projections).
  • You could say why scaling has limits without falling back on the phrase.
  • The five themes connected into one system without you re-opening the lessons.
worth revisiting before Phase 3
  • The integrated traces broke at the same join each time (often uncertainty → optimisation).
  • You confused confidence with correctness, or a distribution with its top answer.
  • You couldn't say why geometry enables generalisation, only that it does.
  • "Scaling" still meant "bigger is better" rather than a curve with a floor and a ceiling.
  • A specific station stayed fuzzy when you tried to explain it from memory.
readiness spectrum this is a diagnosis, not a grade: no marks, no percentage, no pass or fail Strong understanding traced the integrative questions cleanly from memory Mostly secure short questions solid; a connection or two needed a moment Needs review themes clear alone, but the connections between them broke Return to selected stations specific stations stayed fuzzy; revisit before Phase 3 most readers sit across two zones at once; the point is to know which stations to revisit, not to score yourself.
FIG C2.3. The readiness spectrum. Sort your answers along it rather than totalling them. Where you sit tells you what to do next: nothing, a quick reconnection, a review of the connections, or a return to specific stations. It is diagnosis, not judgment.

Place yourself on the spectrum honestly. Most readers sit across two zones, strong on some themes and shaky on a connection or two. The map below turns "shaky" into "revisit this station".

If this brokeRevisitWhy it matters for Phase 3 and beyond
Geometry → generalisation, or distance (C2.1, C2.2)L11 · L12 · L13Every later representation is a vector geometry; if "near means similar" isn't mechanism for you, embeddings and attention stay symbols.
Matrices vs projections (C2.4, C2.5, C2.6)L14 · L15Architectures are stacks of transforms and projections; confusing reshape with discard hides where information is lost.
Distribution vs single answer, calibration (C2.7, C2.8)L16 · L17Model confidence and uncertainty estimation in deployment depend on reading outputs as distributions, not labels.
What entropy tells you (C2.9)L18Cross-entropy is the training loss for almost everything ahead; if entropy is fuzzy, the loss is fuzzy.
Why learning needs a loss / why it stalls (C2.10, C2.11)L19All of training and scaling is optimisation; if the loss-and-gradient picture is loose, Phase 5 floats.
Why scaling has limits (C2.12, C2.13)L20 · L21Phase 3 is the hardware that delivers the compute; the diminishing-returns story is what makes that hardware matter.
The integrated traces broke (C2.14–C2.18)S2 synthesisS2 is the worked example of connecting the themes. If the joins broke, re-walk the wall as one picture there.
calibration · what counts as ready Ready doesn't mean perfect. It means the five themes connect into one system in your head, and you can run that chain on a system you haven't seen. If you can trace any one of the integrated questions cleanly from memory, the rest will firm up as Phase 3 grounds it in hardware. Walk the wall one more time: name each of the eleven stations and its mechanism. The first one you can't picture is the first one to revisit.

What changes from here

Phase 2 explained the ideas. It ends on a demand it can't satisfy alone: this needs an enormous amount of arithmetic, run in parallel, and the cost lives in the hardware. Phase 3 walks through the heavy door into the machine room and takes up that hardware, CPUs, GPUs, memory, bandwidth, accelerators, servers, clusters, and datacentres, as the answer to the demand the wall created.

You don't need any of that yet to pass through this checkpoint. You need the wall to hold: representation, transformation, uncertainty, learning, scale, connected into one machine. If it does, the machine room will read as the silicon underneath ideas you already understand, not as a fresh pile of specifications.

Next station

The wall is behind you and the bench is clear. Through the heavy door is the server bay, where Phase 3 shows the machines that turn these foundations into working systems. Cross when the five themes hold together on their own; revisit the fuzzy stations first if they don't.