Systems synthesis. The loop, the geometry, the constraint surface
Synthesis lesson S1. End of Phase 1. ~22 min read + cards + retrieval. Durability tier 1 (bedrock; the compressed shape of Phase 1).
🧩
Memory palace · Bench · synthesis walk
The whole bench, end to end. You walk all 10 stations in one breath (reading lamp, blank page, folded map, dust in sunlight, toolbox, maze on the wall, mirror, spool of solder, compass, dust cover) and see them as one machine.
Core idea. Modern AI is an optimisation system operating over learned representations, shaped by geometry, constrained by hardware, with capabilities and failure modes that emerge from those pressures interacting. Every Phase 1 lesson was one slice of that statement.
Why this lesson exists
You've walked the bench. Ten stations, ten lessons, ten angles on a single picture. This lesson takes the angles and lays them on top of each other. It introduces no major new mechanism. It does the work of repacking what Phase 1 unpacked, so that what survives in your memory is not ten separate facts but one connected mental model.
Synthesis is not recap. The point isn't to summarise the ten lessons. The point is to make the connections between them visible, so the next time you meet an AI system in the wild, you can read it as a single machine rather than as a list of components.
The loop, reappearing
In Lesson 1 you met the system loop: input → representation → optimisation → output → feedback. It looked abstract then because it had to. You see it everywhere now.
A vision classifier: the input is an image, the representation is the conv-layer activations, the optimisation is gradient descent on cross-entropy, the output is a class, the feedback is the labelled error signal. Same loop. A robot policy: input is state, representation is encoded state vector, optimisation is policy gradient on reward, output is an action, feedback is the environment's next state. Same loop. A frontier LLM: input is a token sequence, representation is the embedding-and-attention stack, optimisation is gradient descent on next-token cross-entropy, output is a token, feedback is the next-token error. Same loop.
Recommender systems, retrieval pipelines, RLHF post-training, image generation, code completion: all the same loop. The slots are filled differently; the shape doesn't change. That's what Lesson 1 was installing. You can now spot the loop in any AI system on a single read.
The geometry, accumulating
Lesson 2 said prediction and compression are two views of one operation. Lesson 7 said representation is the form of input the system actually computes on. Lesson 9 said geometry is the operational substrate that makes downstream computation cheap.
These are not three separate claims. They are one claim viewed at different scales. A model that predicts well has compressed its inputs into a useful internal form. That form is a vector geometry. Operations on the geometry (distance, projection, similarity) are what produce the predictions. Improvements in the geometry are improvements in the prediction.
Lesson 8 inserts tokenisation as the first lossy compression of language into discrete units. Lesson 9 lifts those units into dense embedding vectors in a learned vector space where operational similarity becomes spatial proximity. Together they are the input-side of the modern pipeline; everything downstream computes on the resulting geometry.
Optimisation does the work
Lesson 3 said generalisation is when representations transfer to inputs the system never saw. Lesson 4 said emergence is what happens when a representation crosses the capacity threshold for some subskill at scale. Lesson 5 said the learning signal determines what representations get built in the first place. Lesson 6 said reinforcement learning is the case where the signal is sparse and delayed, and credit has to be assigned across long trajectories.
These are four faces of one operation. Optimisation moves parameters in the direction that lowers the loss. The shape of the loss determines what optimisation can find. The signal density determines how much gradient information flows per step. The architecture determines which basins are reachable. The hardware determines which compute and memory budgets are available at training time.
Change any of those, and you change which basins are reachable. That is the mechanism behind both generalisation and emergence. It is also why some capabilities arrive sharply at scale (a new basin became reachable; the model can suddenly do 3-digit arithmetic) and others rise smoothly (the optimiser was sharpening an existing basin; translation quality keeps climbing).
The constraint surface
Lesson 10 drew the perimeter. Modern AI is jagged because capability is a product of: which training objective shaped the representation, which inductive bias the architecture provided, which hardware allowed the training, which deployment budget allows the serving, and which measurements the field uses to check the result.
Tasks aligned with all of those (translation, OCR, retrieval, code completion in popular APIs) sit firmly inside the capable region. Tasks outside the training distribution, or requiring grounding in the physical world, or requiring long-horizon planning, or requiring novel composition of subskills, sit at the brittle edge. The boundary doesn't follow human intuition because the boundary is set by the mechanism, not by what a human finds difficult.
Hallucination sits at this boundary. So does sim-to-real failure. So does the surprising frailty around letter-counting or multi-digit arithmetic. Each failure mode traces back through the stack to the specific mechanism that produced it.
The five core laws, earned
You met each of these in the phase. The synthesis move is recognising them as one architecture.
compression · the 5 core laws, with their phase-1 anchors
Representation shapes computation. L7: whatever the representation can encode is what the model can compute over. Bad representation, no downstream operation recovers it.
Optimisation shapes capability. L4, L5: the optimiser settles into basins the signal can find. Different signal, different basins, different capability.
Hardware shapes architecture. Threaded throughout: tokenisation choices, matmul-shaped models, embedding-table sizes, KV cache budgets. The transformer became dominant because it runs fast on tensor cores.
Geometry enables generalisation. L9: embeddings encode similarity as distance and relationships as direction. Generalisation is what those distances let you do when the inputs are unfamiliar.
Constraints shape systems. The whole phase. Finite training data, finite compute, finite memory, finite wall-clock, finite real-world interaction budget. Every architecture and training choice is downstream of one of those.
You should now be able to recall the lesson where you first saw each law in action, not just recite the words. That's the difference between knowing them as slogans and knowing them as mechanism.
The causal chain
Trace it from below.
mechanism · hardware to failure modesHardware (what tensor cores accelerate, what VRAM holds, what bandwidth allows) determines architecture (matmul-heavy transformers, distilled small models, MoE routing). Architecture determines representation (which embedding space the model can build). Representation determines optimisation (which basins are reachable, which subskills can be learned). Optimisation determines capability (what the trained model can actually do reliably). Capability determines failure mode (the inverse of capability is where the system breaks).
Each layer constrains the one above it. Each layer carries the choices of the one below it. When you meet a new AI system, you can now read it from any layer and infer the ones around it. A model that runs in 100 ms on a phone tells you something about its training; a model that struggles with novel composition tells you something about its representation; a model that hallucinates confidently tells you something about its objective.
The compute spectrum, in this view
The same stack applies at every tier. At hyperscale, the constraints are about coordinating thousands of accelerators and managing the data pipeline. At workstation tier, the constraints are about fitting weights in VRAM and serving with usable latency. At edge tier, the constraints are about milliwatts and kilobytes. At microcontroller tier, the constraints become severe enough that whole architectures get replaced with quantised distillations of bigger ones.
The principles stay; the shape of the constraint surface changes. The same causal chain runs at every tier; it just hits different walls.
What you can now do without any formal maths
A reader who finishes Phase 1 can:
Look at any AI system and identify which loop slot is being filled by what.
Distinguish capability emergence from measurement-shaped emergence on a worked example.
Predict whether a new architecture is responding to a hardware constraint or a training constraint.
Name the failure modes a given system is likely to exhibit, given its training objective.
Recognise which compute-spectrum tier a system is built for, and what that constrains.
That's the conceptual vocabulary the next six phases will run on. None of those abilities required maths formalism. They are reasoning patterns built on the mechanisms.
The bridge to Phase 2
But every claim in Phase 1 is, underneath, a mathematical claim. Compression is a quantitative statement about how many bits are needed to describe a distribution. Generalisation is a statement about how a function behaves outside its training data. Geometry is the language of vectors, directions, distances, and projections. Optimisation is gradient descent over a loss surface. Each of those has been a conceptual placeholder; the next phase puts the apparatus in.
You've been reasoning around the maths for ten lessons. Phase 2 puts the maths underneath what you already understand. Vectors. Matrices. Probability. Entropy. Gradients. Loss landscapes. Parallelism. Compute scaling. Each is a tool the next five phases will use heavily. By the end of Phase 2, the loop and the geometry and the constraint surface will be expressible in two lines of notation, and you'll be able to read those lines without bluffing.
The bench is complete. You step away. The whiteboard wall is across the workshop.
The integrated machine
Figure S1.1 lays the phase out as a single stack. The L1 system loop is on the right, evolved with the context Phase 1 added. The six layers on the left are the same machine viewed from below: hardware up through failure surface. The arrows are the causal chain; the five core laws label where each one bites.
FIG S1.1. Phase 1 compressed. Left: six layers of the stack, hardware up through failure surface, with the five core laws labelling where each one bites the system. Right: the L1 system loop redrawn with phase-1 context. Every AI system in the phase fits this shape, with the slots filled differently per system. Bottom: the causal chain in plain text. The diagram is not the phase; the phase is in the lessons. The diagram is the shape you carry across the workshop.
Where the bench ends
You stand at the dust cover. Looking back along the bench: ten stations, one machine. Looking across the workshop: the whiteboard wall, already filling with the maths Phase 2 will name.
The bench did its job. You can now think about modern AI mechanistically without bluffing. What you can't yet do is read the equations underneath. That's the next room.
Flashcards
Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.
Retrieval practice
These are deeper than the per-lesson questions. Trace mechanisms; don't summarise. Write your answer first, then reveal.
S1 A frontier LLM is asked for a legal citation about a niche topic and produces a confident, well-formatted citation that does not exist. Trace this failure mechanistically through the full Phase 1 stack: hardware, architecture, representation, optimisation, capability surface. Each layer should appear in your answer.
Hardware: the model is large because tensor cores accelerate the matmuls a 70B-parameter transformer needs, and HBM bandwidth allowed the training. Without the hardware, the model wouldn't exist at this scale. Architecture: the transformer's attention-and-MLP stack produces sequences of tokens by predicting one token at a time from context. There is no architectural component that consults a fact database; everything the model produces is internal. Representation: the model's embedding geometry encodes "legal citation shape" (year, court, party names, page number) as a tight cluster of plausible-looking patterns. The geometry of "real citations" and "syntactically valid invented citations" is essentially identical from the model's internal viewpoint; both occupy similar regions of the latent space. Optimisation: the training objective rewarded statistically likely next tokens, not true tokens. The optimiser settled into basins that produce fluent, well-formatted text; truth-grounding wasn't part of the signal. The model never had access to a ground-truth oracle during training. Capability surface: the model is highly capable on common legal citations (interpolation region) and brittle on niche ones (extrapolation region). At the brittle edge, the most-likely tokens form a plausible but fictional answer. Confidence is uniform in the surface because no calibrated uncertainty signal is exposed. This is hallucination as a mechanism, not a mystery: every layer contributed, and the failure traces cleanly through the stack. The fix has to come from outside the model (retrieval grounding, tool use, or human verification) because the mechanism doesn't naturally produce truth-grounded output.
S1 Compare a 70B-parameter LLM running on a server cluster and a 1B-parameter distilled model running on a phone. Both are doing "language understanding". Using Phase 1 vocabulary only (no maths formalism), explain (a) what shifts in the stack across the two systems, (b) what stays constant, (c) where each system is likely to fail and why.
(a) What shifts. Hardware: server cluster has hundreds of GB of VRAM, fast interconnect, no real-time constraints; phone has a few GB of memory shared with the OS, an NPU with limited throughput, strict latency and thermal budgets. Architecture: the 70B model has the full attention stack at full precision; the 1B distilled model has fewer layers, narrower widths, aggressive quantisation (int8 or int4), and the attention pattern may be simplified (grouped-query, sliding window). Representation: the 70B model's embedding space is much higher-dimensional with finer-grained clusters and richer directions; the 1B model has a coarser geometry. Optimisation: the 70B model was trained at hyperscale with trillions of tokens; the 1B model was distilled from it on a curated subset, so it inherits an approximation of the 70B's geometry rather than building its own. Capability surface: the 70B model has a much broader capable region; the 1B model's perimeter is narrower and more jagged. (b) What stays constant. The loop: input → representation → optimisation → output → feedback. The mechanism: prediction over learned geometry. The 5 core laws: both systems are downstream of representation, optimisation, hardware, geometry, and constraints. Hallucination is possible in both. Distribution shift hurts both. The reasoning patterns transfer; only the constraint set differs. (c) Failures. 70B model: confident hallucination on niche topics, expensive to serve, long latency for complex prompts, scaling-related artefacts (lost-in-the-middle on very long context). 1B model: more frequent hallucination, more brittle on out-of-distribution input, weaker compositional reasoning, faster degradation under quantisation noise. The 1B model fails more often but its failures are easier to diagnose because the constraint set is more visible. The 70B model fails less often but its failures are harder to predict because the constraint set is hidden behind scale.
↳ Phase 2 Pick one Phase 1 claim that you now realise was secretly a mathematical claim. Explain why the conceptual version of the claim was enough to reason with in Phase 1, but what specifically Phase 2's maths apparatus (vectors, gradients, matrices, probability, entropy, scaling) will let you do that conceptual reasoning cannot.
An example answer (others are equally valid). Phase 1 claim: "Embedding geometry encodes similarity as distance and relationships as direction" (L9). Conceptual version: nearby vectors mean similar things; the king-queen-man-woman analogy works because gender and royalty came out as directions. Why the conceptual version was enough for Phase 1: you didn't need to compute cosine similarity to understand that retrieval is a geometry problem, that semantic search returns nearby points, or that embedding collapse means everything mapped to one region. The qualitative reasoning let you predict that a poorly-trained contrastive system would produce uniform similarities, or that a multilingual embedding could enable cross-lingual retrieval. What Phase 2 adds. Vectors give you formal definitions of dot product, projection, norm, and cosine similarity, so you can read attention as a specific operation: query-key dot product, scaled, softmaxed, weighted-summed against values. Matrices give you the apparatus to read embedding tables, projection layers, and attention weights as algebraic objects you can reason about. Probability and entropy let you formalise prediction as a distribution over next tokens, and quantify how compressed a representation is. Gradients let you reason about how the embedding geometry changes during training. Scaling intuition lets you predict where capability lives on the curve. Without Phase 2, you can reason about modern AI mechanistically (which is most of what reading the field requires), but you can't read the equations a paper uses to make its claims precise. Phase 2 promotes your mental model from "I understand the machine conceptually" to "I can read the apparatus the machine is built out of." Those are different abilities, and the second one needs the maths.
Next station
The bench is complete. Cross the workshop to the whiteboard wall (Phase 2). Lesson 11 opens at the arrow on the board with vectors; the geometric intuition you've already built becomes the apparatus the rest of the course will run on.