Phase 2 teaches that modern AI systems are fundamentally geometric. The conceptual machine from Phase 1 gains a mathematical substrate: vectors, matrices, gradients, distributions, and the cost-side of every operation that follows.
Lessons: L11–L21 + S2 + C2Time: ~4 weeksBuilds: B2 gradient descent on a 2D surface (numpy)Core laws established here: geometry enables generalisation (L9 + L11–L13); optimisation shapes capability (L18–L19)
The transformation
From vocabulary to geometric apparatus.
Phase 1 named representation, optimisation, signal, and constraint. Phase 2 turns each of those words into a thing you can sketch on a wall.
Representations become points in a vector space. Similarity becomes a dot product. Learning becomes a downhill walk on a high-dimensional surface. Belief becomes a distribution. Surprise becomes entropy. Parallelism becomes the cost model that gates everything Phase 3 introduces.
The maths is the minimum needed to read the rest of the course honestly. No proofs. No exam questions. Each piece earns its place because it appears later as a load-bearing primitive.
Phase 2 in one line
Modern AI systems are geometric systems. Meaning becomes spatial structure. Optimisation becomes movement through that structure. Capability becomes the shape of what the geometry can express.
Geometry as the substrate
Two pictures the rest of the course assumes.
The two figures below are the conceptual scaffolding Phases 3 through 7B all rest on. Left: the vector space that representations live in. Right: the optimisation landscape that learning moves through.
Fig 2 · A 2D projection of an embedding space. Similar things cluster; meaningful differences become parallel directions. The same operation (a dot product on a vector) is the load-bearing primitive in retrieval (L59), attention (L40), and recommender ranking (L61).
Fig 3 · A 2D loss landscape with a gradient descent trajectory. Each step follows the negative gradient. In high-dimensional spaces (which is where AI training actually lives) saddle points are abundant and bad local minima are rare; the intuition this picture provides still applies.
The 11 stations
The whiteboard wall, left to right.
Each station is a single sketch on the wall, anchored to one mathematical primitive. The wall reads as one picture by the time S2 closes the phase.
Embedding size budget (L59); VRAM accounting (L25); the curse and the blessing of high dimensions.
L14
Matrix · linear transform
Every layer is a matmul (L26, L35, L43). Tensor cores exist because of this.
L15
Projection · subspace · information selection
Attention as selection (L40); feature extraction and latent subspaces (L43, L59); dimensionality reduction and PCA (deep dive). Lossy compression that keeps task-relevant structure.
L16
Probability · uncertainty · distribution
Every classifier and language-model output is a distribution; confidence and calibration (L65); uncertainty estimation.
L17
Sampling · temperature · outcome space
Next-token decoding and generative models sample from distributions (L40, L43); temperature tunes output diversity; Monte Carlo and uncertainty estimation.
L18
Entropy · surprise · cross-entropy
Cross-entropy is the default loss in pretraining (L49), classification (L47), evaluation (L65); entropy ties prediction to compression.
L19
Gradient · loss · optimisation landscape
The training loop everywhere: backprop (L36), SGD/Adam (L49), and the failure modes of optimisation.
L20
Parallelism · throughput · SIMD
Why GPUs took over (L23); the four ways to shard a model (L33); throughput-bound training.
L21
Compute · scaling · diminishing returns
Formal scaling laws (L51); the compute-data-parameter triangle (L50); why efficiency and hardware matter.
Phase 2 themes
Four ideas the wall reinforces.
Theme · 1
Geometry enables generalisation
Once representations live in a vector space, similar things sit near each other and meaningful differences become directions. That spatial structure is what makes generalisation to unseen inputs possible at all.
Theme · 2
Representations become spaces
The Phase 1 word "representation" becomes a concrete object: a point in a 768- or 4096-dimensional space. The shape of the space is the shape of what the system can express.
Theme · 3
Optimisation as movement
Learning becomes a downhill walk on the loss surface. Gradients point the direction; step size is the learning rate; the trajectory is the training run. The picture survives high dimensions intact.
Theme · 4
Meaning as spatial structure
Distance is similarity. Angles encode meaning. Clusters reveal categories the system found on its own. The reader leaves the wall with a working intuition for how meaning sits inside a model.
Contrast pair (retention engineering)
Compute-bound vs memory-bound. Hold both.
L20 and L21 set up the contrast that becomes the spine of Phase 3. A workload is either limited by how fast the chip can compute (compute-bound) or by how fast data can move into the chip (memory-bound). The roofline model in L28 is the visual that makes the distinction operational.
Phase 2 sketches the cost model abstractly. Phase 3 names which specific hardware decisions push a workload into one regime or the other. The contrast is the kind of paired mechanism the course surfaces explicitly so the reader can distinguish them on sight.
Core laws established in Phase 2
What lands here · what recurs later
Geometry enables generalisation. Established here with vectors, dot products, and dimensionality (L11–L13). Recurs in attention (L40, L41), embeddings in practice (L59), and the geometry of representation learning across Phase 4.
Optimisation shapes capability. Gradients and landscapes (L18–L19) make the mechanism concrete. Recurs as scaling laws (L51), as RLHF (L53), as the diagnostic frame for training failure (S5, C5).
Constraints shape systems. Parallelism and compute scaling (L20–L21) name the cost-side of every later architecture choice. Recurs through the whole of Phase 3 (the substrate) and Phase 5 (the cost of training).
Representation shapes computation. Reinforced from L7 by making representation a measurable geometric object. The shape of the embedding space becomes the shape of what the model can compute over.
Bridge to Phase 3
The maths runs on silicon.
Phase 2 closes with a cost model: matmul is the dominant operation, gradients are computed at every step, parallelism is how the operations get done in finite time. None of those facts yet name a chip.
Phase 3 names the chip. CPU, GPU, VRAM, tensor cores, memory hierarchies, the roofline. The same matmul that Phase 2 sketches on the wall becomes the operation tensor cores were designed to run, in the precision the silicon can sustain, against the bandwidth the memory hierarchy provides. The reader walks through the heavy door into the server bay knowing what the substrate has to be good at.
S2 reads the wall as one picture; C2 gates the move. If C2 doesn't stick, you walk the wall again before crossing through to the server bay.