Phase 1 installed a worldview. Ten lessons, one synthesis, one connected stack. The danger after a phase like that is the worldview feels solid because the words feel familiar, while the underlying causal chain is actually still loose.
Phase 2 puts vectors, matrices, probability, gradients, and entropy underneath everything you just built. If the conceptual model is still loose when the maths arrives, the maths floats. You'll know the symbols and lose the meaning. That's the failure mode this calibration is built to catch.
C1 is the point where you test, honestly, whether the worldview attached. Not whether you can recite it.
This is retrieval, not re-reading. Open the lesson, read the question, close everything else, write your answer. Then reveal the model answer and compare them.
The comparison is the work. You're looking for one of three signals: (1) you traced it cleanly, (2) you got the shape but missed a layer of the chain, (3) the chain broke and you had to guess. Each of those tells you something useful about what to do next.
A few practical notes.
The reason retrieval beats re-reading is mechanism. Re-reading produces a feeling of familiarity that masquerades as knowledge. Retrieval forces the actual operation: pulling the structure from memory, finding where it's stored, checking what fell out on the way. The act of struggling to remember strengthens the link more than passive review ever does.
The exercises are ordered by integration depth. Short mechanism checks first (single concept, single answer). Then causal tracing (a chain of layers). Then scenarios (apply the chain to a system you haven't seen). Then a compression task (the whole phase in one paragraph).
If you find yourself stuck on a question, that's data. Note where it broke and move on. The point is to find the weak link, not to power through it.
Single-concept questions. Each one tests whether a specific mechanism from Phase 1 is still operational in your head. Two or three sentences of explanation is enough. The reveal shows what a strong answer demonstrates, then offers a model version.
To predict the next token well, the model has to assign high probability to tokens that actually occur and low probability to ones that don't. Doing that across the training distribution requires capturing the regularities that produced the data: syntax, semantics, world facts, stylistic patterns.
Compression works the same way. Coding the data with the fewest bits requires giving short codes to likely sequences and long codes to unlikely ones. Both operations need an accurate probability distribution over sequences, and the parameters of the model are that distribution.
That's why a model trained on next-token cross-entropy is, at convergence, a learned compressor of its training distribution. The compression isn't a side effect. It's the same operation looked at from the other side: predict well = compress well.
Test-set accuracy alone can't separate the two, because the test set is drawn from the same distribution as the training set. A memoriser and an abstractor can both score 95% there.
The operational test is to probe outside the training distribution. Construct inputs that share the underlying structure of the training data but have different surface form: synonyms, paraphrases, novel combinations, shifted contexts, harder compositional cases. A model that abstracted the underlying mechanism handles these. A model that memorised the surface patterns collapses.
If accuracy stays high under genuine shift, the model has captured the structure that produced the data. If it craters, the model stored the examples rather than the rule. The training metric never tells you which one happened; only behaviour on inputs the training distribution didn't include does.
Two mechanisms are commonly behind what looks like emergence. The first is a representation-capacity threshold: below a certain parameter count, the model literally cannot represent the subskill (e.g. the circuit needed for 3-digit addition won't fit), so the optimiser can't find it no matter how long you train. Above the threshold, the optimiser finds a basin that implements the subskill, and capability appears to switch on.
The second is a measurement effect. Many benchmarks score exact-match: get the whole answer right or get zero. The underlying capability may be improving smoothly with scale, but the score stays at zero until the smooth probability of full correctness crosses the threshold that produces a registered "1". Change the metric to one that rewards partial progress (log-likelihood, per-token accuracy, partial credit) and the curve straightens out.
Both are mechanistic. Neither requires anything mystical. Calling it "emergence" without naming the mechanism is the failure mode this lesson is trying to prevent.
Gradient descent moves parameters in the direction that lowers the loss. The loss function is the only thing the optimiser sees of "what the model is for". Change the loss, and you change the direction the parameters get pushed.
Different objectives reward different internal structure. Cross-entropy on next-token prediction rewards a model that captures sequential statistics. Contrastive loss rewards a model that pulls similar inputs together in embedding space and pushes dissimilar ones apart. Reward maximisation in RL rewards a model that picks actions which produce high reward downstream. Same architecture, same data, different objectives: you get different representations and different capabilities, because the optimiser settled into different basins.
This is why "what was the training objective" is one of the first questions to ask about any AI system. Capability is downstream of the basin the optimiser found. The basin is downstream of the objective. Change the objective and the system is, internally, a different machine.
An embedding is a vector of numbers, typically 100 to 2000 dimensions, that represents an input as a point in a continuous space. The numbers are learned during training. They get tuned so that the training objective is satisfied.
If the objective rewards similar-meaning inputs being treated similarly (e.g. contrastive loss, masked-language-model loss), then "similar meaning" gets baked into the geometry as "nearby". Once that's true, dot product captures similarity, distance captures dissimilarity, and consistent semantic differences (singular vs plural, capital vs country) often show up as repeatable directions. The geometry isn't decoration. It's the form the model computes on.
Multi-layer questions. You're tracing a behaviour or a fact back through the causal chain. A strong answer touches each relevant layer of the stack: hardware, architecture, representation, optimisation, capability, failure surface. Skipping layers is the diagnostic signal.
GPUs were originally designed for graphics, which involves applying the same operation to millions of pixels independently. That means thousands of small arithmetic units running in parallel, fed by high-bandwidth memory. The defining property of the silicon is high throughput on parallel matrix multiplication.
Neural networks are mostly matrix multiplications. Forward and backward passes through a deep network are stacks of matmuls plus a few cheaper element-wise operations. On a CPU, those matmuls run roughly serially. On a GPU, they run roughly in parallel. The training-time compute budget for the same wall-clock budget jumps by one to two orders of magnitude.
That budget shift made deeper, wider models feasible to train. Architectures that fit the hardware (heavy on matmul, light on branching) won; architectures that didn't (recurrent nets with serial dependencies) lost ground. The transformer is the clearest example: attention is dominated by large matrix multiplications, which is precisely what tensor cores accelerate.
So the causal chain runs: parallel matmul throughput → larger training-time compute budgets → matmul-shaped architectures become dominant → frontier-scale models become trainable in finite time. Without the hardware property at the bottom, none of the layers above could exist at the scale they do.
Hardware: the model is large enough to encode broad linguistic and stylistic patterns because tensor-core accelerated matmuls and HBM-class memory bandwidth let it be trained at this scale.
Architecture: the model produces a next token from context using attention plus MLP layers. Nothing in the architecture consults an external truth source. Everything the model emits comes from its parameters.
Representation: the embedding geometry encodes "well-formed legal citation" as a region populated by plausible (year, court, party-name, page-number) patterns. Real and fictional citations of the right shape occupy roughly the same region. The geometry doesn't separate "exists" from "looks like it could exist".
Optimisation: the loss rewarded statistically likely next tokens, not true tokens. The optimiser settled into basins that produce fluent, well-formatted output. Truth-grounding was never part of the gradient signal.
Capability surface: the model is reliable in the interpolation region (common citations seen many times in training) and brittle on niche ones. At the brittle edge, the most-likely continuation is a plausible synthesis of nearby patterns. The model is confident because no calibrated uncertainty signal is exposed; confidence is the default at every position.
The hallucination is the joint output of all five layers behaving as designed. Fixing it requires changing the system, not the prompt: retrieval grounding, tool use, or human verification.
Without retrieval, the only source of facts the model has is its parameters. To answer correctly, the model has to have stored the fact somewhere in its weights and be able to surface it from context cues alone. This is unreliable on niche facts, because the gradient signal for any single fact during training was tiny and may have been dominated by neighbouring patterns.
Retrieval changes the input. A separate system pulls relevant documents from an external store and concatenates them into the model's context window. Now the answer doesn't have to be in the weights. It has to be readable from the context. Models are far better at copying and synthesising information that's present in front of them than at recalling it from parameters.
Mechanistically, the change happens at the input layer, not the architecture or the optimisation. The model itself is unchanged. What changed is the distribution of what's in context at inference time. The capability surface shifts because the brittle region (rare facts) is now serviced by the retrieval store instead of by parametric memory.
Failure modes shift accordingly. Retrieval introduces new ones: bad retrieval, stale corpus, contradictions between retrieved passages, prompt injection in retrieved documents. The system is more factual on average and brittle in different places.
Take KV-cache as the worked example. During autoregressive generation, the transformer needs the keys and values from every previous token to compute attention for the next one. Recomputing them every step would mean redoing most of the forward pass on every token, which is unaffordable.
The hardware constraint: VRAM bandwidth between accelerator memory and compute is the bottleneck. Recomputing means moving the same weights and activations over the bus on every token. The fix is to cache K and V tensors in VRAM and reuse them. That converts re-computation into memory access, which the hardware does better.
That choice then creates new constraints. The KV cache grows linearly with context length, and for a 70B model at long context it can easily dwarf the model weights themselves. So the next generation of architectural moves (grouped-query attention, multi-query attention, sliding-window, latent attention) all attack the size of the cache, not the cleverness of attention. They are downstream of the same VRAM-and-bandwidth wall.
The shape of the architecture is being bent by the shape of the silicon. Read any new LLM paper and the constraint surface is visible in the design choices, if you know to look for it.
Hypothetical systems. You're applying the causal chain to a setup you haven't seen written up. The goal is to name the dominant constraint, predict the failure mode, and explain mechanistically why.
The dominant layer is the joint of representation and optimisation. The architecture is capable of implementing arithmetic in principle (transformers can encode addition and multiplication circuits given enough depth and the right training). The hardware is fine. The problem is what the optimisation actually shaped.
The training data contains far more text describing arithmetic than it does worked examples of arithmetic carried out correctly digit by digit. Cross-entropy on next-token prediction rewards producing fluent text about the procedure, because that's what's overwhelmingly present. It does not specifically reward producing the right digit in column three.
The model's internal representation of "multiply" is therefore much closer to the linguistic shape of the operation than to the actual digit-by-digit computation. When asked to describe the algorithm, it draws on the linguistic representation and succeeds. When asked to execute it, it has no faithful circuit to fall through to, so the output drifts.
The fix sits upstream of the model: chain-of-thought prompting forces the model to externalise the intermediate steps, tool use offloads the operation to a calculator, fine-tuning on long worked examples builds a denser arithmetic signal. Each works by changing either the input distribution or the gradient signal. The mechanism is the same problem looked at from different sides.
RLHF trains the model to maximise a learned reward function that's meant to capture human preference. The reward function is itself a model, trained on pairwise comparisons. Whatever signals the reward model picked up are what the policy gets pushed toward.
If the comparison data systematically rated hedged, caveated, refusing answers as "safer" than direct ones (because annotators were instructed to penalise confident incorrect output), the reward model will reward hedging. The optimiser will then drive the policy into a basin that hedges, even on cases where a direct answer is what the human actually wants.
This is the proxy-objective failure: the optimised quantity (reward-model score) is no longer a faithful stand-in for the underlying goal (human helpfulness). The optimiser doesn't know there's a gap. It optimises what it has.
Diagnostically, the failure is in the optimisation layer, but the root cause is upstream: the reward model was trained on data that conflated safety with hedging. Fixes have to address either the comparison data, the reward model, or the policy update (e.g. KL regularisation back toward the base model, or a different post-training objective entirely).
Semantic search works by mapping the query to a vector and finding nearest neighbours in the document space. For that to give useful results, the embedding geometry has to encode "semantically similar" as "spatially close" in the region the query falls into.
The geometry is built by training. Wherever the training distribution was dense (common topics, common phrasings), the embedding has been pulled into a useful shape: clusters, directions, well-separated neighbours. Wherever the training distribution was sparse (rare domains, jargon, niche phrasing), the geometry is essentially unconstrained. Vectors in sparse regions land near whatever was closest in the loss function's residual, which has no semantic meaning.
So the system works in the dense region and degrades to near-random in the sparse region. This is the same mechanism as interpolation vs extrapolation for any learned function. The embedding model is doing exactly what it was trained to do; the region of input space it was trained on doesn't cover the query.
Fixes are domain-adaptation moves: fine-tune the embedding on in-domain data, or use a hybrid retrieval setup that falls back to lexical matching where the embedding can't carry the load.
Parameter count and architecture are necessary but not sufficient. They define the space of functions the model can in principle implement. They don't determine which function gets implemented. That's the optimiser's job, and the optimiser only sees the data and the loss.
If model A was trained with a much higher proportion of source code, with curated multi-step problem traces, or with code-specific objectives (fill-in-the-middle, repository-level context), the optimiser settled into basins that implement code-shaped circuits. If model B was trained with more mathematical text, more multi-step reasoning data, or post-trained with reasoning-focused RL, its basins implement different circuits.
Same skeleton, different musculature. The capability profile is a direct readout of the training mix and the objectives applied during pretraining and post-training.
This is the cleanest illustration of the second core law: optimisation shapes capability. Architecture sets what's reachable; the training signal decides what gets reached. Reading a model's behaviour is, in part, reading the training process backward.
One large question. Compress Phase 1 into something small enough to carry across the workshop. The point is the act of compression, not the polish of the output.
An example. Others are equally valid.
"Every modern AI system runs the same loop: input becomes a representation, the representation is shaped by an optimisation process driven by a loss signal, and the resulting output feeds back through the loss to update the parameters. The shape of that loop is set by a causal chain that runs upward from the hardware. The silicon decides which architectures are feasible at scale; the architecture decides which representations the model can build; the representation decides which optimisation basins are reachable; the basin the optimiser settles into decides what the system is capable of; and capability defines its own inverse, the failure surface. So the form of the input matters because the model only computes on what the representation can carry. The training objective matters because it's the only signal the optimiser has. The geometry of the learned representation matters because generalisation lives in how the unfamiliar inputs land near the familiar ones. And every architectural choice you see is downstream of a constraint somewhere lower in the stack."
An example sketch.
Two main panels. On the left, a vertical stack of six layers in this order, bottom to top: hardware, architecture, representation, optimisation, capability, failure surface. Upward arrows between each pair, labelled with the core law that connects them: hardware-shapes-architecture, representation-shapes-computation, optimisation-shapes-capability. The top boundary is jagged, illustrating the jagged perimeter from L10.
On the right, the L1 systems loop as four boxes in a square: input, representation, optimisation, output. Clockwise arrows around the loop, plus a feedback arrow from optimisation back to representation labelled "loss signal". A small annotation: "every system in the phase fits this shape; the slots are filled differently".
Bottom strip: the five core laws written out as one-line claims, with the law that connects each pair of layers in the left stack already labelled above. The fifth law (constraints shape systems) labels the whole diagram, not any single arrow.
What gets cut: any worked example (the LLM-vs-CNN-vs-RL table), tokenisation specifics, individual lesson titles, anything decorative. The diagram is the mechanism; the lessons are the worked examples that grounded it.
Figure C1.1 lays out the calibration bench as you'd see it from the workbench stop. The five core laws sit on the bench as instruments; the system under test sits in the middle; the causal chain runs through it, with tracing arrows showing where each law applies. The diagram is meant to be the picture you carry across to the whiteboard wall: not a summary of the lessons, but the shape you reason with when the maths arrives.
This is the section where the calibration earns its name. You're not scoring yourself. You're sorting your own answers into two piles: where the chain held, and where it broke. The signal is in which pile each question landed in.
If you landed mostly in the left pile, you're ready. The next phase will feel like the maths is naming things you already understand.
If you landed mostly in the right pile, that's useful too. The map below tells you which lessons to go back to for which weakness. Revisit them in the order of severity, not the order of the syllabus.
| If this broke | Revisit | Why |
|---|---|---|
| Prediction ↔ compression equivalence (C1.1) | L2 pattern → prediction → compression |
The whole geometry argument later in the phase rests on this being one operation, not two. |
| Memorisation vs abstraction (C1.2) | L3 generalisation |
If these still feel synonymous, the capability surface in L10 won't make mechanistic sense. |
| Emergence felt magical (C1.3) | L4 emergence |
The phase-transition and measurement explanations are the antidote to mysticism here. |
| Objective choice felt cosmetic (C1.4, C1.11) | L5 learning paradigms, L6 RL fundamentals |
If you can't see why the loss reshapes the model, the post-training questions in Phase 4 will float. |
| Embedding geometry felt abstract (C1.5, C1.12) | L7 representation, L9 embeddings |
Phase 2's vector apparatus attaches to this directly. Without it, attention and similarity will be symbols only. |
| Hardware-to-architecture chain broke (C1.6, C1.9) | L1 system loop, L10 perimeter, S1 synthesis |
The causal-chain spine of the phase. If this is loose, every other layer reads as floating. |
| Hallucination trace skipped layers (C1.7) | L10 current AI perimeter, S1 synthesis |
The hallucination trace is the canonical worked example of the full stack. Re-walk it slowly. |
| Compression paragraph read as a list (C1.14, C1.15) | S1 synthesis |
S1 is the model for connected reasoning across the phase. If your compression listed instead of connected, S1 is the worked example to study. |
Phase 2 is the whiteboard wall across the workshop. The bench is done. You step away from the instruments and pick up a marker.
The first thing on the board is vectors. Not because you need a refresher, but because every claim you made above (embedding geometry encodes similarity as distance, attention is a weighted similarity computation, gradient descent moves parameters in a direction) was, underneath, a vector claim. Phase 2 puts the algebra under the intuition.
Then matrices, because architectures are stacks of linear maps and you need to read those as algebraic objects. Then probability, because every "the model assigns higher likelihood to…" was a probability claim waiting for notation. Then entropy, because compression and prediction are quantitatively the same operation. Then gradients, because optimisation is just the geometry of a loss surface. Then computation and scaling, because hardware constraints become quantitative once you can count operations.
None of it will be new. All of it will be precise.
The bench is behind you. The whiteboard wall is across the workshop. Lesson 11 opens with vectors. Bring the geometric intuition you've already built; the apparatus is what you're picking up next.