Phase 3 · the server bay · 12 stations · S3 + C3

The machine room.

Phase 2 taught the mathematics. Phase 3 teaches the physical substrate that makes those mathematical ideas practical. The question of the phase is simple: what kind of machine must exist to make Phase 2 possible?

Lessons: L22–L33 + S3 + C3 Time: ~5 weeks Builds: B5 roofline profile, B6 quantisation lab (+ optional B16 TinyML on an MCU) Compute-spectrum lens: dominant here Core law established here: hardware shapes architecture
The transformation

From mathematical abstractions to physical reality.

Every idea from the whiteboard wall has a physical cost. A vector is numbers that occupy memory. A matrix becomes a matrix multiplication that runs on specific silicon. A gradient is an enormous amount of arithmetic. Optimisation is a workload that consumes hardware for weeks. Scaling is infrastructure: racks, interconnects, power, and cooling.

So the mathematics of Phase 2 doesn't float free. It has to be realised in physical systems, and those systems have limits. Memory has a size and a speed. Moving data costs time and energy. A chip can only do so much arithmetic per second, and only if it can be fed fast enough. Phase 3 is where those limits stop being footnotes and start shaping everything.

This is not a buyer's guide and not a product comparison. The point is systems intuition: the instinct to look at a model and ask where it hits the memory wall, what fraction of the chip it actually uses, and why it is shaped to fit the hardware it runs on.

Phase 3 in one line

The mathematics of Phase 2 sets what is possible in principle. The physical cost of moving, storing, and transforming information sets what is possible in practice. Every abstraction eventually becomes electricity moving through hardware.

Hardware as the substrate

Two pictures the rest of the course assumes.

The two figures below are the physical scaffolding Phases 4 through 7B rest on. Left: a single accelerator, where the real cost is moving data, not doing arithmetic. Right: the cluster, where the model is just one component of a system made of interconnects, storage, power, and cooling.

fig A · the modern AI accelerator where the work happens, and what it spends its time waiting on accelerator die compute units · tensor cores (fast, cheap, plentiful) on-chip SRAM / cache (tiny, very fast) HBM / VRAM large, but slower memory bus hierarchy: registers → SRAM → HBM → DRAM → disk, each rung roughly 10× slower than the one above computation is cheap. moving data is expensive. the compute units sit idle unless the memory system can feed them; most AI workloads are limited by bandwidth, not by raw arithmetic.
Fig A · The modern AI accelerator. A wall of compute and tensor-core units (cheap and plentiful) is fed from large but slower HBM/VRAM across a memory bus that is the real bottleneck. The on-chip cache is tiny and fast; the hierarchy below it trades speed for size. The phase's first durable picture: computation is cheap, moving data is expensive, and most AI work is bandwidth-bound.
fig B · the AI cluster the model is one component of a much larger system network · interconnect node several accelerators node several accelerators node several accelerators node several accelerators shared storage · datasets and checkpoints power feed: tens of megawatts, a first-class constraint cooling: the cold aisle removes the heat the power put in modern AI is a systems problem, not a chip problem.
Fig B · The AI cluster. Many accelerators are gathered into nodes, wired together by interconnects and a network spine, fed by shared storage, and surrounded by power and cooling. At frontier scale the chip is only one part: interconnect bandwidth, storage, megawatts of power, and the cooling to remove the resulting heat are all first-class engineering constraints. The unit of frontier compute is the datacentre, not the chip.
The 12 stations

The server bay, around the cold aisle.

Each station is one piece of the machine, anchored to a physical object you can picture. Walked clockwise, they read as a single argument: memory hierarchy as the spine, the roofline as the constraint surface, and matmul-shaped silicon as the response.

L22The old CPU on the shelf · the general-purpose CPUA serial workhorse with deep pipelines and caches: fast at one varied task, weak at bulk parallel arithmetic. L23The GPU board on the rack · why GPUs existThroughput over latency: thousands of simple units that made matmul-heavy AI trainable at all. L24The warp and lane diagram · the GPU execution modelSIMT, lanes, and warps: one instruction running across many data values at once. L25The VRAM stick · VRAM and the memory wallCapacity and bandwidth: why model size is capped by memory, not compute, and why feeding the chip is the hard part. L26The tensor core schematic · tensor cores and matmul enginesSilicon shaped for the one operation (L14's matrix multiply) that dominates training and inference. L27The memory hierarchy ladder · memory hierarchiesRegisters, SRAM, HBM, DRAM, disk: each rung roughly 10× slower than the one above, which decides where data should live. L28The roofline poster · the roofline modelCompute-bound on one side, memory-bound on the other: how to read which limit a workload actually hits. L29The TPU pod model · TPUs and other acceleratorsSystolic arrays and designed-for-one-workload silicon: what changes when a chip is built for a single operation. L30The NPU in the phone · NPUs and edge inferenceAI on phones, laptops, and embedded boards: milliwatts and kilobytes, a different shape of chip. L31The NVLink ribbon · interconnects and networkingThe bandwidth between chips, which matters as much as the bandwidth inside them: NVLink, InfiniBand, PCIe, and why it bounds cluster scale. L32The quantisation scale · quantisationFP32 down to int4: what you give up at each step, the lever that makes the whole compute spectrum traversable. L33The cluster diagram on the back wall · distributed compute patternsData, tensor, pipeline, and expert parallel: how a model is split across many chips, with the storage, power, and cooling around them. S3Synthesis · the cold-aisle walkThe 12 stations as one loop: memory hierarchy as the spine, the roofline as the constraint surface, matmul-shaped silicon as the response. C3Calibration · hardware constraint analysisArithmetic intensity, precision damage, parallel-strategy comparison. Gate to Phase 4.
The hardware toolkit

Each primitive, and where it becomes load-bearing later.

Phase 3 only teaches hardware that has a later job. The table maps each primitive to where it constrains a decision in a future phase.

L#
Primitive
Where it becomes load-bearing
L22
CPU
The control plane of every system, and the baseline GPUs are measured against (L23). Where serial, branchy work stays.
L23
GPU
Every training and inference run (Phases 5 and 6). Why matmul-shaped architectures won (Phase 4).
L24
SIMT execution
Why attention and matmul map cleanly onto the hardware (L43); the cost model behind batching (L66).
L25
VRAM · bandwidth
The hard cap on model size and context length; the KV-cache budget at inference (L66).
L26
Tensor cores
L14's matrix multiply realised in silicon; the unit of training throughput and the reason transformers fit the chip (L43).
L27
Memory hierarchy
Where weights, activations, and the KV cache live (L66); the spine of every performance decision.
L28
Roofline · latency
The diagnostic for compute-bound vs memory-bound, used to read any kernel or run in Phases 5 and 6.
L29
Accelerators (TPU/NPU)
Designed-for-one-workload silicon; the vendor-neutral view that recurs in MoE routing and inference accelerators.
L30
Edge inference
The edge tier of the compute spectrum: distilled and quantised models (L56, L32) that run on a phone.
L31
Interconnects
The bandwidth that bounds distributed training (L33) and multi-machine inference; why clusters are wired the way they are.
L32
Quantisation
The lever that moves a model down the compute spectrum (L56 distillation, L66 inference): precision vs quality.
L33
Clusters · infrastructure
How frontier training is sharded (Phase 5); the four parallelisms, plus the storage, power, and cooling a datacentre needs.
Phase 3 themes

Four ideas the server bay reinforces.

Theme · 1

Constraints shape systems

The dominant law of the phase. Memory size, bandwidth, power, and cooling are not background details; they are the boundaries every architecture and training run lives inside. The shape of modern AI is downstream of these limits.

Theme · 2

Data movement dominates cost

Computation is cheap and getting cheaper; moving data between memory and compute is the expensive part. Most AI workloads are bandwidth-limited, not compute-limited, which is why the memory hierarchy and the roofline matter more than raw arithmetic counts.

Theme · 3

Parallelism creates capability

A single chip is a wall of parallel units; a cluster is a wall of chips. Capability at scale comes from doing the same arithmetic in enormous parallel breadth, which is exactly why the hardware is built the way it is (a callback to L20).

Theme · 4

Infrastructure determines scale

At the frontier the model is one component of a system that includes interconnects, storage, power, and cooling. Power and cooling become first-class engineering constraints, and the real unit of frontier compute is the datacentre, not the chip.

Core laws established in Phase 3

What lands here · what recurs later

  • Hardware shapes architecture. Established here. The matmul-shaped transformer, KV-cache-aware attention variants, and mixture-of-experts routing are all responses to silicon. Recurs across all of Phase 4 (every architecture reads as a hardware response), Phase 5 (the cost of training), and Phase 6 (inference and deployment).
  • Constraints shape systems. Reinforced and made physical. Threaded through Phase 2 as a cost model, it becomes concrete here as memory, bandwidth, power, and cooling, and recurs as the limiting factor in every later engineering decision.
  • Representation shapes computation. Callback to L7 and Phase 2. The representation is data that lives in memory and moves across a bus; its size and precision (L25, L32) are hardware decisions as much as modelling ones.
  • The compute spectrum. Sharpened here. The same mechanisms run from a microcontroller (L30) to a datacentre (L33); what changes is the constraint set, and quantisation (L32) is the lens that makes the spectrum traversable.
Bridge to Phase 4

The machines built inside the room.

You now have the foundations (Phase 1), the mathematics (Phase 2), and the hardware (Phase 3). The machine room is no longer mysterious: you can see what the silicon is good at, why moving data costs more than computing, and why a cluster looks the way it does.

The next question is what structures run on top of all that. Phase 4 takes up neural architectures: the perceptron, the multilayer network, convolutional and recurrent nets, attention, and the transformer. Each one will read as a response to a prior limit and to the hardware underneath it, which is exactly the lens this phase built. The machine room is no longer mysterious; now we examine the machines built inside it.