PHASE 3 · THE SERVER BAY
L22 · 22 / 79 visited

The general-purpose CPU

Phase 3 · Lesson 1. The old CPU on the shelf, first station in the server bay. ~20 min read + cards + retrieval. Durability tier 1 (bedrock; a CPU trades raw parallel throughput for flexibility and low-latency decisions).

🖥️
Memory palace · Server bay · station 22
The old CPU on the shelf, the first thing you see through the heavy door into the server bay. A worn but dependable machine that runs almost anything you hand it: an operating system, a browser, a database, a game. Lately it sits under a growing pile of matrix calculations, straining. Mapping: the machine = a general-purpose processor; the pile of matrices = AI's arithmetic demand; the strain = the mismatch this phase opens with. Core recall: a CPU is built for flexibility and fast decisions, not bulk parallel arithmetic.
Core idea. A CPU is the universal machine. It runs almost any task you give it and finishes each one fast, which it pays for by spending most of its silicon on flexibility and quick decisions rather than raw arithmetic. That trade makes it superb for general computing and a poor fit for the bulk parallel maths that modern AI runs on.

Why this lesson exists

Phase 2 closed on a demand it couldn't satisfy on its own: modern AI needs an enormous amount of arithmetic, run in parallel, and the cost of it lives in hardware. Phase 3 walks through the heavy door and takes up that hardware. The question for the whole phase is plain. What physical machine actually performs the work the whiteboard wall described?

We start with the CPU, for one reason. It's the machine almost every computer was built around, and the processors AI actually runs on only make sense once you can see what the CPU does well and where it runs out of room. Get the CPU right and the next lesson nearly writes itself.

The machine on the shelf

Picture the server bay you just entered. On a shelf near the door sits an older processor, the kind that's been quietly running things for decades. It's a CPU, a central processing unit, and some version of it has been at the centre of nearly every computer you've used: laptop, phone, the server behind a website, the controller in your car.

It earned that spot by being good at almost everything. Before we can say why AI needed something different, we should be fair about what makes it such a good general machine.

The universal machine

A CPU is a general-purpose machine. Hand it almost any job and it does it: render a web page, run the operating system, recalculate a spreadsheet, answer a database query, drive a game. One piece of silicon, an enormous range of work.

That range is the whole point. The CPU doesn't know in advance what you'll ask of it, so it's built to run arbitrary instructions in arbitrary order, switching from one kind of task to another in nanoseconds. The same chip that just sorted a list can start decompressing an image, then handle a keypress, then talk to the disk.

This flexibility is worth a great deal. It's why one device can run thousands of programs nobody had in mind when the chip was designed.

flowchart LR T1["web browser"]:::t --> CPU T2["operating system"]:::t --> CPU T3["spreadsheet"]:::t --> CPU T4["database query"]:::t --> CPU T5["video game"]:::t --> CPU["the CPU
one general machine"]:::c CPU --> O["runs all of them,
switching in nanoseconds"]:::g classDef t fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef c fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef g fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 22.1. The universal machine. Wildly different jobs all run on the same processor, which switches between them in nanoseconds. The CPU isn't told in advance what you'll ask of it, so it's built to run anything. That readiness is the strength this lesson starts from.

The cost of flexibility

Flexibility isn't free. To run any task on demand, a CPU has to spend its resources on being ready for anything.

Real programs are full of decisions. If this value is negative, do one thing; otherwise do another. Those forks are called branches, and a general machine has to handle them fast, often guessing which way a branch will go before it knows for sure, so it doesn't stall waiting. It reorders instructions when that's quicker, tracks what depends on what, and keeps large caches of recent data close so it rarely waits on slow memory.

All of that is silicon spent on versatility. Most of a CPU's transistors go to control, prediction, and memory: the machinery that keeps a few instruction streams moving quickly through unpredictable work. Comparatively little goes to raw arithmetic. For general computing that's the right call. Hold onto it, because it's the exact balance AI will push against.

Latency versus throughput

Here's the distinction the rest of Phase 3 leans on. There are two different things you can ask a machine to be good at, and they pull in different directions.

Latency is how quickly one task finishes. You ask a question, you want the answer now. A CPU is built for low latency: get this one job done, fast.

Throughput is how much total work finishes over time, no matter how long any single piece takes. You have a mountain of work and you care about clearing the whole pile, not about any one item in it.

figure 22.2 · latency vs throughput two things a machine can be good at, pulling in different directions 1 passenger arrives fast low latency · finishes one task fast = the CPU thousands of tonnes, one trip high throughput · clears a huge pile over time slower door to door, enormous total moved
FIG 22.2. Latency against throughput. The sports car gets one passenger there quickly; that's low latency, and it's the bet the CPU makes. The freight train is slower on any single trip but moves an enormous load at once; that's high throughput. Neither wins in the abstract. The job decides which you want.

The CPU is the sports car. It's tuned to finish each task with the shortest possible delay. That's the right design for software full of decisions, where the next step usually depends on the result of the last and you can't start it until the previous one lands. A courier on a motorbike beats a cargo ship to deliver one envelope. The cargo ship wins the moment you need to move ten thousand containers.

Inside the machine

A quick look inside, at the level that matters for intuition. No transistor diagrams.

A CPU is built from a few cores. A core is an independent worker that runs its own stream of instructions. A laptop chip might have 4 to 16 of them; a big server chip a few dozen. Each core is large and clever, packed with the prediction and reordering machinery from the last section.

Around the cores sit caches: small fast memories that hold recently used data so a core doesn't wait on main memory. They come in levels, a tiny and very fast L1 right next to each core, a larger and slower L2, then a bigger and slower L3 shared across cores. Past all of them is main memory, large but much slower to reach.

Inside each core are the execution units that do the actual work, including the arithmetic units that add and multiply. Notice the proportion: only a small part of the chip is arithmetic. The rest exists to keep a handful of tasks fed and moving fast.

figure 22.3 · inside the machine a few powerful cores, wrapped in cache and control logic CPU package core ALU L1 control · branch prediction out-of-order logic (most of the area) core core core ALU + L1 ALU + L1 ALU + L1 shared L2 / L3 cache (larger, slower) main memory DRAM large, slow slow link most of the silicon keeps a few streams fast (control, prediction, cache). only a small part is arithmetic.
FIG 22.3. Inside a CPU, at the level that matters for intuition. A handful of large cores, each with a little arithmetic and a tiny fast L1, share bigger and slower L2/L3 cache, and reach out to large but slow main memory. The proportions are the point: most of the chip exists to keep a few tasks moving quickly, not to do bulk arithmetic.

That layout is the physical form of the latency bet: a small number of powerful cores, wrapped in cache and control logic, all arranged to finish each task quickly.

Where AI creates pressure

Now bring back the whiteboard wall. Phase 2 showed that modern AI is, underneath, a colossal amount of simple arithmetic. A vector is a list of numbers (L11). A matrix multiply (L14) is thousands or millions of multiply-and-adds. Training runs the optimisation loop (L19) over those operations across trillions of examples, and compute scaling (L21) only makes the pile bigger.

Two things stand out about that work. There's a staggering quantity of it. And most of it is independent: the multiplications inside one matrix don't depend on each other, so in principle they could all happen at once.

That's a throughput problem, and it's the kind of work a CPU isn't shaped for. A CPU has a few fast cores, so it chews through that mountain of arithmetic a small slice at a time. It does the maths correctly. It just can't do enough of it at once to keep up. Most of the chip, the prediction and caching that make it a great general machine, does nothing useful for a flat pile of identical multiplications.

figure 22.4 · the mismatch a flat pile of identical arithmetic, met by a few fast cores one matrix multiply (L14): millions of independent multiply-adds CPU 4 fast lanes work waiting cleared a slice at a time the maths is done correctly, just not fast enough at once
FIG 22.4. Why a CPU strains on AI work. A single matrix multiply is a huge field of identical, independent operations. A CPU has a few fast lanes, so it clears that field a small slice at a time and a backlog builds. The arithmetic is correct; there just isn't enough of it happening at once. Most of the chip, the part that makes it a great general machine, has nothing to do here.

The wall appears

Push that mismatch far enough and you hit a wall.

As AI workloads grew, the amount of parallel arithmetic stopped being something a handful of fast cores could absorb. You can add a few more cores, run wider arithmetic instructions, buy a faster chip. Each helps a little. But the distance between "a few powerful cores" and "billions of independent multiplications" is too wide to close by making the CPU more of what it already is.

At that point a different question becomes worth asking. What if you gave up some of that flexibility and some of that per-task speed, and spent the silicon the other way: thousands of simple arithmetic units running the same operation across mountains of data at once? You'd lose the sports car. You'd gain the freight train.

That machine exists. Building it, and seeing why it's shaped the way it is, is the next lesson.

flowchart LR G["general computing
varied, branchy tasks"]:::a --> D["rising arithmetic
demand"]:::b --> AI["AI workloads
matrices, gradients at scale"]:::b --> P["pressure a few fast
cores can't absorb"]:::w --> N["a throughput machine
becomes attractive
(next lesson)"]:::g classDef a fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef b fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef w fill:#1d2230,stroke:#f87171,color:#e6e8ee; classDef g fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 22.5. The pressure that built. General computing made the CPU the universal machine. Then AI turned arithmetic into the dominant cost, at a scale a few powerful cores can't keep up with. That gap is the opening for a different kind of machine, which the next lesson takes up.
compression · what to carry forward

What you should be able to do now

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace the mechanism; don't summarise.

L22 Why is a CPU called a general-purpose machine, and what does it trade for that?
A CPU is general-purpose because it can run almost any task without being designed for that task ahead of time. It executes arbitrary instructions in arbitrary order and switches between completely different kinds of work in nanoseconds, which is why one chip can run an operating system, a browser, a database, and a game. The trade is where the silicon goes. To stay ready for anything and to finish each task quickly, a CPU spends most of its transistors on control, branch prediction, out-of-order execution, and large caches: the machinery that keeps a few unpredictable instruction streams moving fast. Comparatively little of the chip is raw arithmetic. So the price of flexibility and low latency is that a CPU can't do very much arithmetic at the same time. For general computing that's the right call, because real software is full of decisions and dependencies. For a flat pile of identical maths it's the wrong shape, which is the tension the rest of the phase builds on.
L22 Explain latency versus throughput with a physical analogy, and say which one the CPU is built for and why.
Latency is how quickly a single task finishes; throughput is how much total work gets done per unit of time, regardless of how long any one task takes. The picture is a sports car against a freight train. The sports car carries one passenger and arrives fast, so it's low latency. The freight train is slower on any single trip but moves thousands of tonnes at once, so it's high throughput. Neither is better in the abstract: a courier beats a cargo ship to deliver one envelope, and the cargo ship wins the moment you need ten thousand containers moved. The CPU is the sports car. It's tuned to finish each task with the shortest possible delay, because most software is a chain of decisions where the next step depends on the result of the last, and you can't start it until the previous one lands. Low latency on that serial, branchy work is exactly what you want, which is why the CPU is organised around it: a few powerful cores wrapped in caches and prediction logic, all arranged to get one task done fast.
↩ L20 Interleaved. Phase 2 called AI training a throughput problem (L20). Connect that to why a CPU, with its few fast cores, struggles with it.
L20 made the point that AI training is about chewing through an enormous pile of matrix arithmetic, almost all of it independent, so the goal is total work done, not finishing any single operation fastest. That's throughput. Parallelism is what turns "more hardware" into "more usable throughput," because independent work can run at the same time across many units. A CPU is built the other way. It has a few large cores, each optimised to finish one task with the lowest possible latency, and most of its silicon goes to the prediction and caching that serve serial, branchy code. Point that machine at a flat field of billions of identical multiplies and most of it is wasted: there are only a handful of lanes, so the field gets cleared a small slice at a time while a backlog builds. The CPU is doing exactly what it was designed to do; that design is just the wrong shape for this work. It pours its effort into the thing AI doesn't need here (low latency on one task) and skimps on the thing it does need (huge arithmetic throughput in parallel). That mismatch is the whole reason the phase goes looking for a different machine.
↩ L14 Interleaved. Why is the matrix multiply (L14) the operation that exposes the CPU's limits so cleanly?
A matrix multiply (L14) is the workload that has every property a CPU handles badly and none it handles well. It's enormous: even a single layer can be millions of multiply-and-add operations, and a model runs huge numbers of these. It's uniform: the same operation repeats across the whole matrix, with no branches and no decisions, so the prediction and out-of-order machinery a CPU spends most of its silicon on has nothing to do. And it's independent: each output element is its own sum of products that doesn't depend on the others, so in principle they could all be computed at once. That combination is the definition of a throughput problem. A CPU meets it with a few latency-optimised cores, so it grinds through a tiny slice at a time even though the work is begging to be done in parallel. Because the matrix multiply is also the dominant operation in modern AI (representation, transformation, and the optimisation loop all reduce to it), the mismatch isn't a corner case. It's the main event, which is why it points so directly at building a machine made of many simple arithmetic units instead of a few clever ones.
↳ L23 Forward. From everything here, predict the shape of a machine that would handle AI's arithmetic better than a CPU. What would you give up, and what would you gain?
If the work is a huge pile of independent, identical arithmetic, you'd build a machine that does the opposite of the CPU's bet. Instead of a few large cores optimised to finish one task fast, you'd pack in thousands of small, simple units that all run the same operation across different data at the same time. You'd spend the silicon on arithmetic, not on branch prediction and deep caches, because there are almost no branches to predict here. You'd accept that any single task finishes more slowly (higher latency) in exchange for clearing a vast amount of work per unit time (far higher throughput). In the analogy, you'd trade the sports car for the freight train. What you give up is generality and low-latency response: this machine would be poor at varied, branchy, decision-heavy code, the things the CPU is great at, and it would lean on the CPU to run the program around it. What you gain is the ability to keep up with billions of multiply-adds at once, which is exactly what training and inference demand. That throughput machine is the GPU, and the next lesson takes up why it exists and how it makes that trade.

Next station

That's the first station in the server bay. You can now see the CPU clearly: a general-purpose, low-latency machine that's superb at varied work and badly shaped for a flat mountain of parallel arithmetic. The next station is the GPU board on the rack, the machine built to make exactly the trade this lesson ended on, throughput over latency. L23 explains why GPUs exist.