Phase 2 taught the mathematics. Phase 3 teaches the physical substrate that makes those mathematical ideas practical. The question of the phase is simple: what kind of machine must exist to make Phase 2 possible?
Lessons: L22–L33 + S3 + C3Time: ~5 weeksBuilds: B5 roofline profile, B6 quantisation lab (+ optional B16 TinyML on an MCU)Compute-spectrum lens: dominant hereCore law established here: hardware shapes architecture
The transformation
From mathematical abstractions to physical reality.
Every idea from the whiteboard wall has a physical cost. A vector is numbers that occupy memory. A matrix becomes a matrix multiplication that runs on specific silicon. A gradient is an enormous amount of arithmetic. Optimisation is a workload that consumes hardware for weeks. Scaling is infrastructure: racks, interconnects, power, and cooling.
So the mathematics of Phase 2 doesn't float free. It has to be realised in physical systems, and those systems have limits. Memory has a size and a speed. Moving data costs time and energy. A chip can only do so much arithmetic per second, and only if it can be fed fast enough. Phase 3 is where those limits stop being footnotes and start shaping everything.
This is not a buyer's guide and not a product comparison. The point is systems intuition: the instinct to look at a model and ask where it hits the memory wall, what fraction of the chip it actually uses, and why it is shaped to fit the hardware it runs on.
Phase 3 in one line
The mathematics of Phase 2 sets what is possible in principle. The physical cost of moving, storing, and transforming information sets what is possible in practice. Every abstraction eventually becomes electricity moving through hardware.
Hardware as the substrate
Two pictures the rest of the course assumes.
The two figures below are the physical scaffolding Phases 4 through 7B rest on. Left: a single accelerator, where the real cost is moving data, not doing arithmetic. Right: the cluster, where the model is just one component of a system made of interconnects, storage, power, and cooling.
Fig A · The modern AI accelerator. A wall of compute and tensor-core units (cheap and plentiful) is fed from large but slower HBM/VRAM across a memory bus that is the real bottleneck. The on-chip cache is tiny and fast; the hierarchy below it trades speed for size. The phase's first durable picture: computation is cheap, moving data is expensive, and most AI work is bandwidth-bound.
Fig B · The AI cluster. Many accelerators are gathered into nodes, wired together by interconnects and a network spine, fed by shared storage, and surrounded by power and cooling. At frontier scale the chip is only one part: interconnect bandwidth, storage, megawatts of power, and the cooling to remove the resulting heat are all first-class engineering constraints. The unit of frontier compute is the datacentre, not the chip.
The 12 stations
The server bay, around the cold aisle.
Each station is one piece of the machine, anchored to a physical object you can picture. Walked clockwise, they read as a single argument: memory hierarchy as the spine, the roofline as the constraint surface, and matmul-shaped silicon as the response.
Each primitive, and where it becomes load-bearing later.
Phase 3 only teaches hardware that has a later job. The table maps each primitive to where it constrains a decision in a future phase.
L#
Primitive
Where it becomes load-bearing
L22
CPU
The control plane of every system, and the baseline GPUs are measured against (L23). Where serial, branchy work stays.
L23
GPU
Every training and inference run (Phases 5 and 6). Why matmul-shaped architectures won (Phase 4).
L24
SIMT execution
Why attention and matmul map cleanly onto the hardware (L43); the cost model behind batching (L66).
L25
VRAM · bandwidth
The hard cap on model size and context length; the KV-cache budget at inference (L66).
L26
Tensor cores
L14's matrix multiply realised in silicon; the unit of training throughput and the reason transformers fit the chip (L43).
L27
Memory hierarchy
Where weights, activations, and the KV cache live (L66); the spine of every performance decision.
L28
Roofline · latency
The diagnostic for compute-bound vs memory-bound, used to read any kernel or run in Phases 5 and 6.
L29
Accelerators (TPU/NPU)
Designed-for-one-workload silicon; the vendor-neutral view that recurs in MoE routing and inference accelerators.
L30
Edge inference
The edge tier of the compute spectrum: distilled and quantised models (L56, L32) that run on a phone.
L31
Interconnects
The bandwidth that bounds distributed training (L33) and multi-machine inference; why clusters are wired the way they are.
L32
Quantisation
The lever that moves a model down the compute spectrum (L56 distillation, L66 inference): precision vs quality.
L33
Clusters · infrastructure
How frontier training is sharded (Phase 5); the four parallelisms, plus the storage, power, and cooling a datacentre needs.
Phase 3 themes
Four ideas the server bay reinforces.
Theme · 1
Constraints shape systems
The dominant law of the phase. Memory size, bandwidth, power, and cooling are not background details; they are the boundaries every architecture and training run lives inside. The shape of modern AI is downstream of these limits.
Theme · 2
Data movement dominates cost
Computation is cheap and getting cheaper; moving data between memory and compute is the expensive part. Most AI workloads are bandwidth-limited, not compute-limited, which is why the memory hierarchy and the roofline matter more than raw arithmetic counts.
Theme · 3
Parallelism creates capability
A single chip is a wall of parallel units; a cluster is a wall of chips. Capability at scale comes from doing the same arithmetic in enormous parallel breadth, which is exactly why the hardware is built the way it is (a callback to L20).
Theme · 4
Infrastructure determines scale
At the frontier the model is one component of a system that includes interconnects, storage, power, and cooling. Power and cooling become first-class engineering constraints, and the real unit of frontier compute is the datacentre, not the chip.
Core laws established in Phase 3
What lands here · what recurs later
Hardware shapes architecture. Established here. The matmul-shaped transformer, KV-cache-aware attention variants, and mixture-of-experts routing are all responses to silicon. Recurs across all of Phase 4 (every architecture reads as a hardware response), Phase 5 (the cost of training), and Phase 6 (inference and deployment).
Constraints shape systems. Reinforced and made physical. Threaded through Phase 2 as a cost model, it becomes concrete here as memory, bandwidth, power, and cooling, and recurs as the limiting factor in every later engineering decision.
Representation shapes computation. Callback to L7 and Phase 2. The representation is data that lives in memory and moves across a bus; its size and precision (L25, L32) are hardware decisions as much as modelling ones.
The compute spectrum. Sharpened here. The same mechanisms run from a microcontroller (L30) to a datacentre (L33); what changes is the constraint set, and quantisation (L32) is the lens that makes the spectrum traversable.
Bridge to Phase 4
The machines built inside the room.
You now have the foundations (Phase 1), the mathematics (Phase 2), and the hardware (Phase 3). The machine room is no longer mysterious: you can see what the silicon is good at, why moving data costs more than computing, and why a cluster looks the way it does.
The next question is what structures run on top of all that. Phase 4 takes up neural architectures: the perceptron, the multilayer network, convolutional and recurrent nets, attention, and the transformer. Each one will read as a response to a prior limit and to the hardware underneath it, which is exactly the lens this phase built. The machine room is no longer mysterious; now we examine the machines built inside it.