PHASE 1 → 2 · CALIBRATION
C1 · calibration stop

Calibration. Mechanism, geometry, and constraint

Calibration assessment C1. Between Phase 1 and Phase 2. ~40 min with honest retrieval. Durability tier 1 (the checkpoint that decides whether the maths attaches to anything).

🔧
Memory palace · Central workbench · calibration stop
You pause at the bench. The instruments are laid out: the systems loop, embeddings, capability charts, optimisation diagrams, failure traces. Before crossing to the whiteboard wall, you check that you can still read the machine without referring to the lessons.
Core idea. Calibration is a diagnostic, not an exam. It asks one question: when you trace the five core laws through a real system, does the causal chain hold under your own reasoning, or does it break? You're checking your own model of the machine before the maths gets bolted to it.

Why this lesson exists

Phase 1 installed a worldview. Ten lessons, one synthesis, one connected stack. The danger after a phase like that is the worldview feels solid because the words feel familiar, while the underlying causal chain is actually still loose.

Phase 2 puts vectors, matrices, probability, gradients, and entropy underneath everything you just built. If the conceptual model is still loose when the maths arrives, the maths floats. You'll know the symbols and lose the meaning. That's the failure mode this calibration is built to catch.

C1 is the point where you test, honestly, whether the worldview attached. Not whether you can recite it.

How to approach this

This is retrieval, not re-reading. Open the lesson, read the question, close everything else, write your answer. Then reveal the model answer and compare them.

The comparison is the work. You're looking for one of three signals: (1) you traced it cleanly, (2) you got the shape but missed a layer of the chain, (3) the chain broke and you had to guess. Each of those tells you something useful about what to do next.

A few practical notes.

The reason retrieval beats re-reading is mechanism. Re-reading produces a feeling of familiarity that masquerades as knowledge. Retrieval forces the actual operation: pulling the structure from memory, finding where it's stored, checking what fell out on the way. The act of struggling to remember strengthens the link more than passive review ever does.

The exercises are ordered by integration depth. Short mechanism checks first (single concept, single answer). Then causal tracing (a chain of layers). Then scenarios (apply the chain to a system you haven't seen). Then a compression task (the whole phase in one paragraph).

If you find yourself stuck on a question, that's data. Note where it broke and move on. The point is to find the weak link, not to power through it.

Section 1. Short mechanism checks

Single-concept questions. Each one tests whether a specific mechanism from Phase 1 is still operational in your head. Two or three sentences of explanation is enough. The reveal shows what a strong answer demonstrates, then offers a model version.

C1.1 mechanism · prediction ↔ compression
Explain why next-token prediction and compression are two views of one operation. Not the slogan; the mechanism.
Reveal model answer
Strong answer demonstrates: understanding that a good predictor is implicitly a good compressor because both require the same internal structure: a model of the underlying distribution.

To predict the next token well, the model has to assign high probability to tokens that actually occur and low probability to ones that don't. Doing that across the training distribution requires capturing the regularities that produced the data: syntax, semantics, world facts, stylistic patterns.

Compression works the same way. Coding the data with the fewest bits requires giving short codes to likely sequences and long codes to unlikely ones. Both operations need an accurate probability distribution over sequences, and the parameters of the model are that distribution.

That's why a model trained on next-token cross-entropy is, at convergence, a learned compressor of its training distribution. The compression isn't a side effect. It's the same operation looked at from the other side: predict well = compress well.

C1.2 mechanism · memorisation vs abstraction
A model gets 95% accuracy on its test set. What would you need to check to decide whether it's memorising or abstracting? Give the operational test, not the definition.
Reveal model answer
Strong answer demonstrates: the distinction lives in behaviour under distribution shift, not in the training metric.

Test-set accuracy alone can't separate the two, because the test set is drawn from the same distribution as the training set. A memoriser and an abstractor can both score 95% there.

The operational test is to probe outside the training distribution. Construct inputs that share the underlying structure of the training data but have different surface form: synonyms, paraphrases, novel combinations, shifted contexts, harder compositional cases. A model that abstracted the underlying mechanism handles these. A model that memorised the surface patterns collapses.

If accuracy stays high under genuine shift, the model has captured the structure that produced the data. If it craters, the model stored the examples rather than the rule. The training metric never tells you which one happened; only behaviour on inputs the training distribution didn't include does.

C1.3 mechanism · emergence
Explain why emergence at scale is mechanistic and not magical. Reference at least one specific mechanism behind apparent emergence.
Reveal model answer
Strong answer demonstrates: the appearance of sudden capability has at least two known mechanistic explanations: representation-capacity thresholds and discontinuous measurement functions.

Two mechanisms are commonly behind what looks like emergence. The first is a representation-capacity threshold: below a certain parameter count, the model literally cannot represent the subskill (e.g. the circuit needed for 3-digit addition won't fit), so the optimiser can't find it no matter how long you train. Above the threshold, the optimiser finds a basin that implements the subskill, and capability appears to switch on.

The second is a measurement effect. Many benchmarks score exact-match: get the whole answer right or get zero. The underlying capability may be improving smoothly with scale, but the score stays at zero until the smooth probability of full correctness crosses the threshold that produces a registered "1". Change the metric to one that rewards partial progress (log-likelihood, per-token accuracy, partial credit) and the curve straightens out.

Both are mechanistic. Neither requires anything mystical. Calling it "emergence" without naming the mechanism is the failure mode this lesson is trying to prevent.

C1.4 mechanism · why the objective matters
Why does the choice of training objective change what a model is good at, even with the same data and the same architecture?
Reveal model answer
Strong answer demonstrates: the objective is the only signal the optimiser has. It shapes which basins are reachable and therefore which representations the model builds.

Gradient descent moves parameters in the direction that lowers the loss. The loss function is the only thing the optimiser sees of "what the model is for". Change the loss, and you change the direction the parameters get pushed.

Different objectives reward different internal structure. Cross-entropy on next-token prediction rewards a model that captures sequential statistics. Contrastive loss rewards a model that pulls similar inputs together in embedding space and pushes dissimilar ones apart. Reward maximisation in RL rewards a model that picks actions which produce high reward downstream. Same architecture, same data, different objectives: you get different representations and different capabilities, because the optimiser settled into different basins.

This is why "what was the training objective" is one of the first questions to ask about any AI system. Capability is downstream of the basin the optimiser found. The basin is downstream of the objective. Change the objective and the system is, internally, a different machine.

C1.5 mechanism · embedding geometry
In two or three sentences, explain what a learned embedding actually is, mechanistically, and why operations on it (distance, direction, projection) can carry semantic meaning.
Reveal model answer
Strong answer demonstrates: an embedding is a learned coordinate in a vector space, and the geometry of that space encodes the structure the training objective rewarded.

An embedding is a vector of numbers, typically 100 to 2000 dimensions, that represents an input as a point in a continuous space. The numbers are learned during training. They get tuned so that the training objective is satisfied.

If the objective rewards similar-meaning inputs being treated similarly (e.g. contrastive loss, masked-language-model loss), then "similar meaning" gets baked into the geometry as "nearby". Once that's true, dot product captures similarity, distance captures dissimilarity, and consistent semantic differences (singular vs plural, capital vs country) often show up as repeatable directions. The geometry isn't decoration. It's the form the model computes on.

Section 2. Causal tracing

Multi-layer questions. You're tracing a behaviour or a fact back through the causal chain. A strong answer touches each relevant layer of the stack: hardware, architecture, representation, optimisation, capability, failure surface. Skipping layers is the diagnostic signal.

C1.6 tracing · why GPUs mattered
Trace, from the hardware up, why GPUs were the precondition for the modern deep-learning era. The chain should go from a property of the silicon to a class of models that became possible.
Reveal model answer
Strong answer demonstrates: the chain from parallel matmul throughput through training-time compute budgets to architectures that are matmul-shaped to frontier-scale models.

GPUs were originally designed for graphics, which involves applying the same operation to millions of pixels independently. That means thousands of small arithmetic units running in parallel, fed by high-bandwidth memory. The defining property of the silicon is high throughput on parallel matrix multiplication.

Neural networks are mostly matrix multiplications. Forward and backward passes through a deep network are stacks of matmuls plus a few cheaper element-wise operations. On a CPU, those matmuls run roughly serially. On a GPU, they run roughly in parallel. The training-time compute budget for the same wall-clock budget jumps by one to two orders of magnitude.

That budget shift made deeper, wider models feasible to train. Architectures that fit the hardware (heavy on matmul, light on branching) won; architectures that didn't (recurrent nets with serial dependencies) lost ground. The transformer is the clearest example: attention is dominated by large matrix multiplications, which is precisely what tensor cores accelerate.

So the causal chain runs: parallel matmul throughput → larger training-time compute budgets → matmul-shaped architectures become dominant → frontier-scale models become trainable in finite time. Without the hardware property at the bottom, none of the layers above could exist at the scale they do.

C1.7 tracing · hallucination, end to end
An LLM produces a confident, well-formatted citation that does not exist. Trace this through the full Phase 1 stack: hardware, architecture, representation, optimisation, capability surface. Every layer should appear.
Reveal model answer
Strong answer demonstrates: hallucination is a mechanism, not a bug. Each layer of the stack contributes.

Hardware: the model is large enough to encode broad linguistic and stylistic patterns because tensor-core accelerated matmuls and HBM-class memory bandwidth let it be trained at this scale.

Architecture: the model produces a next token from context using attention plus MLP layers. Nothing in the architecture consults an external truth source. Everything the model emits comes from its parameters.

Representation: the embedding geometry encodes "well-formed legal citation" as a region populated by plausible (year, court, party-name, page-number) patterns. Real and fictional citations of the right shape occupy roughly the same region. The geometry doesn't separate "exists" from "looks like it could exist".

Optimisation: the loss rewarded statistically likely next tokens, not true tokens. The optimiser settled into basins that produce fluent, well-formatted output. Truth-grounding was never part of the gradient signal.

Capability surface: the model is reliable in the interpolation region (common citations seen many times in training) and brittle on niche ones. At the brittle edge, the most-likely continuation is a plausible synthesis of nearby patterns. The model is confident because no calibrated uncertainty signal is exposed; confidence is the default at every position.

The hallucination is the joint output of all five layers behaving as designed. Fixing it requires changing the system, not the prompt: retrieval grounding, tool use, or human verification.

C1.8 tracing · retrieval improves factuality
Adding a retrieval step (RAG) often improves factuality on knowledge-heavy tasks. Explain mechanistically why, in terms of where in the stack the change is happening.
Reveal model answer
Strong answer demonstrates: retrieval changes the input to the model, not its parameters. The mechanism for factuality moves from "remembered in weights" to "supplied in context".

Without retrieval, the only source of facts the model has is its parameters. To answer correctly, the model has to have stored the fact somewhere in its weights and be able to surface it from context cues alone. This is unreliable on niche facts, because the gradient signal for any single fact during training was tiny and may have been dominated by neighbouring patterns.

Retrieval changes the input. A separate system pulls relevant documents from an external store and concatenates them into the model's context window. Now the answer doesn't have to be in the weights. It has to be readable from the context. Models are far better at copying and synthesising information that's present in front of them than at recalling it from parameters.

Mechanistically, the change happens at the input layer, not the architecture or the optimisation. The model itself is unchanged. What changed is the distribution of what's in context at inference time. The capability surface shifts because the brittle region (rare facts) is now serviced by the retrieval store instead of by parametric memory.

Failure modes shift accordingly. Retrieval introduces new ones: bad retrieval, stale corpus, contradictions between retrieved passages, prompt injection in retrieved documents. The system is more factual on average and brittle in different places.

C1.9 tracing · hardware constrains architecture
Pick one architectural choice in modern LLMs (examples: grouped-query attention, sliding-window attention, mixture-of-experts, KV-cache, low-precision inference). Trace it back to the specific hardware constraint that motivated it.
Reveal model answer
Strong answer demonstrates: architecture changes upstream of capability are usually downstream of a memory or bandwidth limit, not of pure algorithmic preference.

Take KV-cache as the worked example. During autoregressive generation, the transformer needs the keys and values from every previous token to compute attention for the next one. Recomputing them every step would mean redoing most of the forward pass on every token, which is unaffordable.

The hardware constraint: VRAM bandwidth between accelerator memory and compute is the bottleneck. Recomputing means moving the same weights and activations over the bus on every token. The fix is to cache K and V tensors in VRAM and reuse them. That converts re-computation into memory access, which the hardware does better.

That choice then creates new constraints. The KV cache grows linearly with context length, and for a 70B model at long context it can easily dwarf the model weights themselves. So the next generation of architectural moves (grouped-query attention, multi-query attention, sliding-window, latent attention) all attack the size of the cache, not the cleverness of attention. They are downstream of the same VRAM-and-bandwidth wall.

The shape of the architecture is being bent by the shape of the silicon. Read any new LLM paper and the constraint surface is visible in the design choices, if you know to look for it.

Section 3. Scenario reasoning

Hypothetical systems. You're applying the causal chain to a setup you haven't seen written up. The goal is to name the dominant constraint, predict the failure mode, and explain mechanistically why.

C1.10 scenario · the arithmetic-failing LLM
An LLM scores well on language tasks but fails at 5-digit multiplication, even though it can describe the algorithm correctly. Which layer of the stack is the dominant failure source, and why?
Reveal model answer
Strong answer demonstrates: the failure lives in the representation and optimisation layers, not in "understanding". The model's geometry doesn't have a faithful arithmetic circuit because the training signal didn't shape one.

The dominant layer is the joint of representation and optimisation. The architecture is capable of implementing arithmetic in principle (transformers can encode addition and multiplication circuits given enough depth and the right training). The hardware is fine. The problem is what the optimisation actually shaped.

The training data contains far more text describing arithmetic than it does worked examples of arithmetic carried out correctly digit by digit. Cross-entropy on next-token prediction rewards producing fluent text about the procedure, because that's what's overwhelmingly present. It does not specifically reward producing the right digit in column three.

The model's internal representation of "multiply" is therefore much closer to the linguistic shape of the operation than to the actual digit-by-digit computation. When asked to describe the algorithm, it draws on the linguistic representation and succeeds. When asked to execute it, it has no faithful circuit to fall through to, so the output drifts.

The fix sits upstream of the model: chain-of-thought prompting forces the model to externalise the intermediate steps, tool use offloads the operation to a calculator, fine-tuning on long worked examples builds a denser arithmetic signal. Each works by changing either the input distribution or the gradient signal. The mechanism is the same problem looked at from different sides.

C1.11 scenario · the over-hedging assistant
After RLHF, a previously direct model starts hedging excessively: too many caveats, refusals on benign requests, vague answers. Where is the optimisation mismatch?
Reveal model answer
Strong answer demonstrates: the optimised reward and the desired behaviour have diverged. The model learned what the reward model rewards, which is a proxy for what humans want, and the proxy slipped.

RLHF trains the model to maximise a learned reward function that's meant to capture human preference. The reward function is itself a model, trained on pairwise comparisons. Whatever signals the reward model picked up are what the policy gets pushed toward.

If the comparison data systematically rated hedged, caveated, refusing answers as "safer" than direct ones (because annotators were instructed to penalise confident incorrect output), the reward model will reward hedging. The optimiser will then drive the policy into a basin that hedges, even on cases where a direct answer is what the human actually wants.

This is the proxy-objective failure: the optimised quantity (reward-model score) is no longer a faithful stand-in for the underlying goal (human helpfulness). The optimiser doesn't know there's a gap. It optimises what it has.

Diagnostically, the failure is in the optimisation layer, but the root cause is upstream: the reward model was trained on data that conflated safety with hedging. Fixes have to address either the comparison data, the reward model, or the policy update (e.g. KL regularisation back toward the base model, or a different post-training objective entirely).

C1.12 scenario · the brittle embedding
A semantic-search system works well on common queries but returns near-random results on rare or domain-specific ones. Which Phase 1 mechanism is failing?
Reveal model answer
Strong answer demonstrates: rare queries land in sparse regions of the embedding space where neighbour structure is unreliable. The geometry is well-shaped where the training data was dense and ill-shaped where it was sparse.

Semantic search works by mapping the query to a vector and finding nearest neighbours in the document space. For that to give useful results, the embedding geometry has to encode "semantically similar" as "spatially close" in the region the query falls into.

The geometry is built by training. Wherever the training distribution was dense (common topics, common phrasings), the embedding has been pulled into a useful shape: clusters, directions, well-separated neighbours. Wherever the training distribution was sparse (rare domains, jargon, niche phrasing), the geometry is essentially unconstrained. Vectors in sparse regions land near whatever was closest in the loss function's residual, which has no semantic meaning.

So the system works in the dense region and degrades to near-random in the sparse region. This is the same mechanism as interpolation vs extrapolation for any learned function. The embedding model is doing exactly what it was trained to do; the region of input space it was trained on doesn't cover the query.

Fixes are domain-adaptation moves: fine-tune the embedding on in-domain data, or use a hybrid retrieval setup that falls back to lexical matching where the embedding can't carry the load.

C1.13 scenario · two models, same parameter count
Two LLMs have the same parameter count and the same architecture family. One is better at code, the other is better at reasoning. What's the most likely cause, and which Phase 1 law does it illustrate?
Reveal model answer
Strong answer demonstrates: parameter count and architecture set the ceiling of what's reachable; the training data and objective decide what actually gets built. This is "optimisation shapes capability" in clean form.

Parameter count and architecture are necessary but not sufficient. They define the space of functions the model can in principle implement. They don't determine which function gets implemented. That's the optimiser's job, and the optimiser only sees the data and the loss.

If model A was trained with a much higher proportion of source code, with curated multi-step problem traces, or with code-specific objectives (fill-in-the-middle, repository-level context), the optimiser settled into basins that implement code-shaped circuits. If model B was trained with more mathematical text, more multi-step reasoning data, or post-trained with reasoning-focused RL, its basins implement different circuits.

Same skeleton, different musculature. The capability profile is a direct readout of the training mix and the objectives applied during pretraining and post-training.

This is the cleanest illustration of the second core law: optimisation shapes capability. Architecture sets what's reachable; the training signal decides what gets reached. Reading a model's behaviour is, in part, reading the training process backward.

Section 4. Compression

One large question. Compress Phase 1 into something small enough to carry across the workshop. The point is the act of compression, not the polish of the output.

C1.14 compression · one paragraph
Compress Phase 1 into one paragraph. Constraint: it must contain the systems loop, the causal chain, and at least three of the five core laws as operational claims, not as named items.
Reveal model answer
Strong answer demonstrates: the ability to write the phase as connected mechanism, not as a list. Look for whether your version reads as one machine or as a row of components.

An example. Others are equally valid.

"Every modern AI system runs the same loop: input becomes a representation, the representation is shaped by an optimisation process driven by a loss signal, and the resulting output feeds back through the loss to update the parameters. The shape of that loop is set by a causal chain that runs upward from the hardware. The silicon decides which architectures are feasible at scale; the architecture decides which representations the model can build; the representation decides which optimisation basins are reachable; the basin the optimiser settles into decides what the system is capable of; and capability defines its own inverse, the failure surface. So the form of the input matters because the model only computes on what the representation can carry. The training objective matters because it's the only signal the optimiser has. The geometry of the learned representation matters because generalisation lives in how the unfamiliar inputs land near the familiar ones. And every architectural choice you see is downstream of a constraint somewhere lower in the stack."

C1.15 compression · one diagram
If you had to draw Phase 1 as one diagram on a single whiteboard, what would be on it? Sketch it in words: what boxes, what arrows, what labels, what gets cut. The cut decisions are part of the answer.
Reveal model answer
Strong answer demonstrates: understanding which elements are load-bearing and which are decoration. A good answer cuts the lesson-specific examples and keeps the mechanism.

An example sketch.

Two main panels. On the left, a vertical stack of six layers in this order, bottom to top: hardware, architecture, representation, optimisation, capability, failure surface. Upward arrows between each pair, labelled with the core law that connects them: hardware-shapes-architecture, representation-shapes-computation, optimisation-shapes-capability. The top boundary is jagged, illustrating the jagged perimeter from L10.

On the right, the L1 systems loop as four boxes in a square: input, representation, optimisation, output. Clockwise arrows around the loop, plus a feedback arrow from optimisation back to representation labelled "loss signal". A small annotation: "every system in the phase fits this shape; the slots are filled differently".

Bottom strip: the five core laws written out as one-line claims, with the law that connects each pair of layers in the left stack already labelled above. The fifth law (constraints shape systems) labels the whole diagram, not any single arrow.

What gets cut: any worked example (the LLM-vs-CNN-vs-RL table), tokenisation specifics, individual lesson titles, anything decorative. The diagram is the mechanism; the lessons are the worked examples that grounded it.

The calibration bench

Figure C1.1 lays out the calibration bench as you'd see it from the workbench stop. The five core laws sit on the bench as instruments; the system under test sits in the middle; the causal chain runs through it, with tracing arrows showing where each law applies. The diagram is meant to be the picture you carry across to the whiteboard wall: not a summary of the lessons, but the shape you reason with when the maths arrives.

figure c1.1 · the calibration bench system-under-test in the middle · 5 laws as instruments · tracing arrows show what each instrument reads failure surface where it breaks · jagged perimeter capability what it reliably does · the capable region optimisation gradient descent · basin selection · signal density representation / geometry learned vector space · embeddings · clusters · directions architecture matmul-shaped models · attention · MLP · norm · residual hardware vram · bandwidth · matmul · interconnect · quantisation tc causal chain ↑ the bench · instruments arranged for inspection law 1 representation shapes computation law 2 optimisation shapes capability law 3 hardware shapes architecture law 4 geometry enables generalisation law 5 constraints shape systems diagnostic readout trace runs cleanly through stack 5 laws map to specific layers failure mode traces to mechanism optimisation ≠ capability ≠ task geometry as operational substrate causal chain reversible from any layer (if any light is off: revisit, don't bluff) system under test target: any modern AI system method: walk the stack from below read each layer; predict the next trace failures backward through it if a layer is opaque, the chain breaks if a layer is fluent, the chain holds calibration = chain still holds one bench · five instruments · one system · one chain
FIG C1.1. The calibration bench. The system under test (centre) is any modern AI system laid out as the six-layer stack. The five core laws sit on the workbench as instruments, each probing the layer where it bites. The diagnostic readouts on the right are what you're checking for: trace runs cleanly, laws map to layers, failures trace to mechanism, optimisation isn't confused with capability, geometry is read as substrate, and the chain is reversible from any starting point. If any of those readouts is off, that's the lesson to revisit before the whiteboard wall.

Self-evaluation

This is the section where the calibration earns its name. You're not scoring yourself. You're sorting your own answers into two piles: where the chain held, and where it broke. The signal is in which pile each question landed in.

ready for the whiteboard wall
  • You traced causality cleanly across most of the multi-layer questions.
  • You named the dominant constraint in each scenario without grasping.
  • You reconnected the 5 core laws to specific layers without re-reading.
  • Your compression paragraph reads as one machine, not a list.
  • Failure modes felt like a natural readout of the system, not a separate topic.
worth revisiting before Phase 2
  • The causal chain kept breaking at the same layer.
  • You confused memorisation with abstraction, or optimisation with capability.
  • You couldn't say why geometry mattered (only that it did).
  • "Emergence" still felt like magic when you tried to explain it.
  • The compression task produced a list rather than connected mechanism.

If you landed mostly in the left pile, you're ready. The next phase will feel like the maths is naming things you already understand.

If you landed mostly in the right pile, that's useful too. The map below tells you which lessons to go back to for which weakness. Revisit them in the order of severity, not the order of the syllabus.

If this brokeRevisitWhy
Prediction ↔ compression equivalence (C1.1) L2 pattern → prediction → compression The whole geometry argument later in the phase rests on this being one operation, not two.
Memorisation vs abstraction (C1.2) L3 generalisation If these still feel synonymous, the capability surface in L10 won't make mechanistic sense.
Emergence felt magical (C1.3) L4 emergence The phase-transition and measurement explanations are the antidote to mysticism here.
Objective choice felt cosmetic (C1.4, C1.11) L5 learning paradigms, L6 RL fundamentals If you can't see why the loss reshapes the model, the post-training questions in Phase 4 will float.
Embedding geometry felt abstract (C1.5, C1.12) L7 representation, L9 embeddings Phase 2's vector apparatus attaches to this directly. Without it, attention and similarity will be symbols only.
Hardware-to-architecture chain broke (C1.6, C1.9) L1 system loop, L10 perimeter, S1 synthesis The causal-chain spine of the phase. If this is loose, every other layer reads as floating.
Hallucination trace skipped layers (C1.7) L10 current AI perimeter, S1 synthesis The hallucination trace is the canonical worked example of the full stack. Re-walk it slowly.
Compression paragraph read as a list (C1.14, C1.15) S1 synthesis S1 is the model for connected reasoning across the phase. If your compression listed instead of connected, S1 is the worked example to study.
calibration · what counts as ready Ready doesn't mean perfect. It means the conceptual machine has somewhere for the maths to attach. If you can trace any one of the questions above cleanly from your own head, the rest will fall into place as the apparatus arrives. The maths is not new content. It's the language for what you already understand.

What changes from here

Phase 2 is the whiteboard wall across the workshop. The bench is done. You step away from the instruments and pick up a marker.

The first thing on the board is vectors. Not because you need a refresher, but because every claim you made above (embedding geometry encodes similarity as distance, attention is a weighted similarity computation, gradient descent moves parameters in a direction) was, underneath, a vector claim. Phase 2 puts the algebra under the intuition.

Then matrices, because architectures are stacks of linear maps and you need to read those as algebraic objects. Then probability, because every "the model assigns higher likelihood to…" was a probability claim waiting for notation. Then entropy, because compression and prediction are quantitatively the same operation. Then gradients, because optimisation is just the geometry of a loss surface. Then computation and scaling, because hardware constraints become quantitative once you can count operations.

None of it will be new. All of it will be precise.

Next station

The bench is behind you. The whiteboard wall is across the workshop. Lesson 11 opens with vectors. Bring the geometric intuition you've already built; the apparatus is what you're picking up next.