Phase 2 · the whiteboard wall · 11 stations · S2 + C2

Mathematical & computational intuition.

Phase 2 teaches that modern AI systems are fundamentally geometric. The conceptual machine from Phase 1 gains a mathematical substrate: vectors, matrices, gradients, distributions, and the cost-side of every operation that follows.

Lessons: L11–L21 + S2 + C2 Time: ~4 weeks Builds: B2 gradient descent on a 2D surface (numpy) Core laws established here: geometry enables generalisation (L9 + L11–L13); optimisation shapes capability (L18–L19)
The transformation

From vocabulary to geometric apparatus.

Phase 1 named representation, optimisation, signal, and constraint. Phase 2 turns each of those words into a thing you can sketch on a wall.

Representations become points in a vector space. Similarity becomes a dot product. Learning becomes a downhill walk on a high-dimensional surface. Belief becomes a distribution. Surprise becomes entropy. Parallelism becomes the cost model that gates everything Phase 3 introduces.

The maths is the minimum needed to read the rest of the course honestly. No proofs. No exam questions. Each piece earns its place because it appears later as a load-bearing primitive.

Phase 2 in one line

Modern AI systems are geometric systems. Meaning becomes spatial structure. Optimisation becomes movement through that structure. Capability becomes the shape of what the geometry can express.

Geometry as the substrate

Two pictures the rest of the course assumes.

The two figures below are the conceptual scaffolding Phases 3 through 7B all rest on. Left: the vector space that representations live in. Right: the optimisation landscape that learning moves through.

fig 2 · embedding space (2D projection) meaning as direction · similarity as angle · L11 · L12 · L13 dim 1 (of ~768) dim 2 (of ~768) cluster · royalty / gender man woman king queen cluster · animals cat dog horse similarity = cos(θ) (man → woman) ≈ (king → queen)
Fig 2 · A 2D projection of an embedding space. Similar things cluster; meaningful differences become parallel directions. The same operation (a dot product on a vector) is the load-bearing primitive in retrieval (L59), attention (L40), and recommender ranking (L61).
fig 3 · optimisation landscape gradients point downhill · L18 · L19 parameter 1 parameter 2 saddle local minimum init −∇L · step in high dimensions: many saddles, few bad minima · the picture survives
Fig 3 · A 2D loss landscape with a gradient descent trajectory. Each step follows the negative gradient. In high-dimensional spaces (which is where AI training actually lives) saddle points are abundant and bad local minima are rare; the intuition this picture provides still applies.
The 11 stations

The whiteboard wall, left to right.

Each station is a single sketch on the wall, anchored to one mathematical primitive. The wall reads as one picture by the time S2 closes the phase.

L11The arrow on the board · vectorsDirection plus magnitude. The geometric picture before the algebra. L12The pinned map · distance, similarity, and semantic geometryEuclidean distance, dot product, cosine similarity. Where meaning becomes geometry. L13The layered grids · dimensions, feature spaces, and representation capacityWhat it means for vectors to live in 768 or 4096 dimensions. Learned features as the modern move. L14The grid · matrices and linear transformsA matrix as a thing that bends space. L15The stack of transparent sheets · projections, subspaces, and information selectionA projection keeps some information and discards the rest. Selection as the engine under embeddings, attention, and feature extraction. L16The dice rack · probability, uncertainty, and beliefProbability as measured uncertainty. Distributions, confidence vs certainty, calibration. L17The funnel and bins · distributions, sampling, and possible futuresDistribution as the landscape of outcomes; sampling as the draw; temperature as the knob. Why an LLM outputs a distribution over tokens. L18The fog machine · entropy, uncertainty, and surpriseEntropy as average surprise; one number for how much uncertainty a distribution carries. Prediction, compression, cross-entropy. L19The slope & valley · gradients and optimisation landscapesTraining is optimisation: loss is height, gradient is slope, descent steps downhill. Minima, saddle points, learning rate. L20The conveyor belt · parallelismSerial vs parallel; why AI parallelises so well. CPU vs GPU, SIMD, data and model parallelism, throughput vs latency. L21The lever & machine · compute scaling intuitionMore compute helps, but with diminishing returns. Compute, data, and model size grow together. Scaling as empirical, not law. S2Synthesis · the whole wall as one pictureThe five themes as a single toolkit: represent, transform, measure uncertainty, learn, scale. Bridge to Phase 3. C2Calibration · reasoning across the five themesDiagnostic, not a grade: where the mechanisms hold and where the connections break. Gate to Phase 3.
The maths toolkit

Each primitive, and where it shows up next.

Phase 2 only teaches maths that has a later job. The table below maps each primitive to its first downstream use.

L#
Primitive
Where it earns its place
L11
Vector
Representation as a point in space (Phase 4, every architecture); retrieval (Phase 6).
L12
Distance · cosine similarity · dot product
Attention scores (L40), cosine similarity in retrieval (L59), classifier heads (L47, L61).
L13
Dimensionality · feature spaces · capacity
Embedding size budget (L59); VRAM accounting (L25); the curse and the blessing of high dimensions.
L14
Matrix · linear transform
Every layer is a matmul (L26, L35, L43). Tensor cores exist because of this.
L15
Projection · subspace · information selection
Attention as selection (L40); feature extraction and latent subspaces (L43, L59); dimensionality reduction and PCA (deep dive). Lossy compression that keeps task-relevant structure.
L16
Probability · uncertainty · distribution
Every classifier and language-model output is a distribution; confidence and calibration (L65); uncertainty estimation.
L17
Sampling · temperature · outcome space
Next-token decoding and generative models sample from distributions (L40, L43); temperature tunes output diversity; Monte Carlo and uncertainty estimation.
L18
Entropy · surprise · cross-entropy
Cross-entropy is the default loss in pretraining (L49), classification (L47), evaluation (L65); entropy ties prediction to compression.
L19
Gradient · loss · optimisation landscape
The training loop everywhere: backprop (L36), SGD/Adam (L49), and the failure modes of optimisation.
L20
Parallelism · throughput · SIMD
Why GPUs took over (L23); the four ways to shard a model (L33); throughput-bound training.
L21
Compute · scaling · diminishing returns
Formal scaling laws (L51); the compute-data-parameter triangle (L50); why efficiency and hardware matter.
Phase 2 themes

Four ideas the wall reinforces.

Theme · 1

Geometry enables generalisation

Once representations live in a vector space, similar things sit near each other and meaningful differences become directions. That spatial structure is what makes generalisation to unseen inputs possible at all.

Theme · 2

Representations become spaces

The Phase 1 word "representation" becomes a concrete object: a point in a 768- or 4096-dimensional space. The shape of the space is the shape of what the system can express.

Theme · 3

Optimisation as movement

Learning becomes a downhill walk on the loss surface. Gradients point the direction; step size is the learning rate; the trajectory is the training run. The picture survives high dimensions intact.

Theme · 4

Meaning as spatial structure

Distance is similarity. Angles encode meaning. Clusters reveal categories the system found on its own. The reader leaves the wall with a working intuition for how meaning sits inside a model.

Contrast pair (retention engineering)

Compute-bound vs memory-bound. Hold both.

L20 and L21 set up the contrast that becomes the spine of Phase 3. A workload is either limited by how fast the chip can compute (compute-bound) or by how fast data can move into the chip (memory-bound). The roofline model in L28 is the visual that makes the distinction operational.

Phase 2 sketches the cost model abstractly. Phase 3 names which specific hardware decisions push a workload into one regime or the other. The contrast is the kind of paired mechanism the course surfaces explicitly so the reader can distinguish them on sight.

Core laws established in Phase 2

What lands here · what recurs later

  • Geometry enables generalisation. Established here with vectors, dot products, and dimensionality (L11–L13). Recurs in attention (L40, L41), embeddings in practice (L59), and the geometry of representation learning across Phase 4.
  • Optimisation shapes capability. Gradients and landscapes (L18–L19) make the mechanism concrete. Recurs as scaling laws (L51), as RLHF (L53), as the diagnostic frame for training failure (S5, C5).
  • Constraints shape systems. Parallelism and compute scaling (L20–L21) name the cost-side of every later architecture choice. Recurs through the whole of Phase 3 (the substrate) and Phase 5 (the cost of training).
  • Representation shapes computation. Reinforced from L7 by making representation a measurable geometric object. The shape of the embedding space becomes the shape of what the model can compute over.
Bridge to Phase 3

The maths runs on silicon.

Phase 2 closes with a cost model: matmul is the dominant operation, gradients are computed at every step, parallelism is how the operations get done in finite time. None of those facts yet name a chip.

Phase 3 names the chip. CPU, GPU, VRAM, tensor cores, memory hierarchies, the roofline. The same matmul that Phase 2 sketches on the wall becomes the operation tensor cores were designed to run, in the precision the silicon can sustain, against the bandwidth the memory hierarchy provides. The reader walks through the heavy door into the server bay knowing what the substrate has to be good at.

S2 reads the wall as one picture; C2 gates the move. If C2 doesn't stick, you walk the wall again before crossing through to the server bay.