PHASE 2 · THE WHITEBOARD WALL

L20 · 20 / 79 visited

Parallelism

Lesson 20. Tenth station on the whiteboard wall. ~23 min read + cards + retrieval. Durability tier 1 (bedrock; modern AI runs on doing the same simple arithmetic many times at once).

🏭

Memory palace · Whiteboard wall · station 20

The conveyor belt, past the slope and valley from L19. A huge belt carries thousands of identical boxes. One worker handles them one at a time; beside them a warehouse holds thousands of workers, each taking a box as it passes, all working at once. The belt never stops. Mapping: one worker = serial execution; the warehouse = parallel execution. Modern AI works because the warehouse exists.

Core idea. The AI revolution did not come from a new kind of mathematics. It came from running familiar mathematics at enormous scale, and that scale is only reachable because the same operation can be done many times at once. Parallelism is the bridge from elegant theory to practical scale.

Why optimisation created a hardware problem

L19 left the training loop looking almost too simple: measure the loss, follow the slope downhill, repeat. The algorithm is not the hard part. The quantity is.

A frontier model has hundreds of billions of parameters, so each gradient is a direction in hundreds of billions of numbers. That gradient is averaged over enormous data, often trillions of tokens. And one step barely moves you, so a run takes hundreds of thousands to millions of steps. None of the individual arithmetic is difficult; there is just an astronomical amount of it. The bottleneck stopped being the maths and became the sheer volume of identical operations, which is a hardware problem, not a theory problem.

Serial versus parallel work

Two ways to get a pile of work done. Serial means one thing after another: finish the first, then start the second. Parallel means many things at the same time, on many workers at once.

The everyday version is a supermarket. One cashier serving a long queue is serial; the total time is the sum of every customer. Open twenty lanes and twenty customers check out at once; the queue clears in a fraction of the time. The work per customer is unchanged. What changed is how much happens simultaneously.

flowchart LR A["task A"]:::s --> B["task B"]:::s --> C["task C"]:::s --> D["task D"]:::s classDef s fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;

FIG 20.1. Serial execution. One task after another on a single worker. Total time is the sum of the parts; nothing starts until the previous thing finishes.

flowchart LR S(("start")):::n --> A["task A"]:::p S --> B["task B"]:::p S --> C["task C"]:::p S --> D["task D"]:::p A --> E(("done")):::n B --> E C --> E D --> E classDef p fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef n fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;

FIG 20.2. Parallel execution. Independent tasks run at the same time on separate workers. If they truly don't depend on each other, the total time is close to the time of a single task, not the sum.

When parallelism works, and when it doesn't

Splitting work across many workers only helps when the pieces are independent. Grading a thousand exams parallelises perfectly: each exam is its own job, and a hundred graders finish roughly a hundred times faster. Processing a thousand photos is the same story.

Some jobs refuse to split. Assembling a chain where each step needs the result of the one before it has to run in order, no matter how many workers you have. You cannot pour the foundation, frame the walls, and roof the house at the same time; the sequence is fixed. These are dependency constraints, and they are the hard limit on parallelism.

The practical consequence is worth stating plainly, because it matters for the next lesson. If even a small fraction of a job is unavoidably serial, that fraction caps how much faster more workers can make it. Add workers and the parallel part shrinks toward instant, but the serial part stays the same length, so the speedup flattens out. More hardware does not buy unlimited speed; it buys speed up to the point the serial part allows.

Why AI parallelises so well

Here is the insight the whole lesson turns on. The dominant operation in a neural network is matrix multiplication (L14), and a large matrix multiply is built from an enormous number of small arithmetic operations that mostly do not depend on each other. Each output number is its own little sum of products, and thousands of them can be computed at the same time without any one waiting on another.

That makes AI workloads unusually friendly to parallel hardware. The thing models spend almost all their time doing, multiplying big matrices during both training and inference, is close to the ideal case from the previous section: huge amounts of independent work. AI didn't get lucky by accident; the architectures that won (L14) were the ones whose core operation parallelises, because those were the ones the available hardware could run fast.

FIG 20.3. Two ways to build a processor. A CPU spends its transistor budget on a few sophisticated cores that are excellent at varied, branchy, one-at-a-time work. A GPU spends the same budget on thousands of simpler cores that do the same operation across many numbers at once. Neither is universally better; they are bets on different workloads, and matrix-heavy AI is the workload the GPU's bet wins.

CPU intuition

A CPU is built around a small number of very capable cores. Each one is sophisticated: it can branch (do different things depending on a condition), reorder work, juggle many different kinds of task, and switch contexts quickly. That makes a CPU the right tool for running an operating system, handling logic full of decisions, and doing varied general-purpose work where the next step often depends on the last.

The design bet is flexibility and speed on a single thread of work: get one complicated task done quickly. That bet is exactly right for most of computing, and it's why CPUs run everything. It's just not the bet that matters most when the job is the same arithmetic repeated a billion times.

GPU intuition

A GPU makes the opposite bet. Instead of a few clever cores, it has thousands of simpler ones, and they are happiest doing the same operation across many different pieces of data at the same time. That pattern, one instruction applied to many data values at once, is the core idea of SIMD (single instruction, multiple data), and it is what a GPU is built to do at scale.

Matrix and vector operations fit this perfectly: the same multiply-and-add, applied across thousands of numbers, with no branching and few decisions. GPUs were created to render graphics, which is exactly this kind of repeated arithmetic over millions of pixels, and AI inherited them because training is the same shape of work. When people say AI "loves" GPUs, the mechanism is just this: the model's dominant operation is a wall of independent arithmetic, and the GPU is a wall of workers that do independent arithmetic.

Data parallelism

The first way to spread a training run across many workers is data parallelism. Keep one model, but make several copies of it, and give each copy a different batch of data. Each copy computes a gradient on its own batch at the same time, and the gradients are then combined into a single update that's applied to every copy, keeping them in sync. More copies means more data processed per unit of time.

flowchart LR D1["batch 1"]:::d --> M1[["model copy"]]:::m --> G1["gradient"]:::g D2["batch 2"]:::d --> M2[["model copy"]]:::m --> G2["gradient"]:::g D3["batch 3"]:::d --> M3[["model copy"]]:::m --> G3["gradient"]:::g G1 --> C[["combine"]]:::c G2 --> C G3 --> C --> U["one updated model
copies kept in sync"]:::good classDef d fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef m fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef g fill:#1d2230,stroke:#9aa3b2,color:#e6e8ee; classDef c fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 20.4. Data parallelism. The same model, copied across workers, each chewing through a different batch at once. The gradients are combined into one update so the copies stay identical. Combining the gradients takes communication between workers, which is one of the costs that keeps the speedup from being free.

Model parallelism

Sometimes the model itself is too large to fit on a single machine. Then you use model parallelism: split the model into pieces and place each piece on a different machine. An input flows through part one on machine A, whose output feeds part two on machine B, and so on. The machines work on different parts of the same forward and backward pass.

flowchart LR I["input"]:::vec --> P1["model part 1
machine A"]:::m --> P2["model part 2
machine B"]:::m --> P3["model part 3
machine C"]:::m --> O["output"]:::good classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef m fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 20.5. Model parallelism. When a model is too big for one machine, its pieces live on different machines and pass their results along the chain. Note the dependency: part two can't start until part one hands over its output, so this kind of split carries a built-in serial element that pure data parallelism avoids.

Data parallelism and model parallelism are the two basic moves, and real frontier training combines them. The details (how the pieces are split, how the machines talk) belong to the hardware phase. The intuition is enough here: you can scale by running more copies, by splitting the model, or both.

Throughput versus latency

Two different things can mean "fast," and AI usually cares about the second one. Latency is how long a single task takes from start to finish. Throughput is how much total work gets done per unit of time. They are not the same, and optimising for one can cost the other.

A CPU core gives low latency: hand it one task and it finishes quickly. A GPU gives high throughput: hand it ten thousand identical tasks and it gets through the pile faster than anything else, even though any single one of them might take longer than it would on the CPU. Training cares about chewing through trillions of tokens, which is a throughput problem, so the throughput machine wins. This is the real reason AI hardware evolved around throughput: the workload is a huge pile of identical work, and finishing the pile is what matters, not finishing any one piece quickly.

The hidden engine of scaling

Put the chain together. Optimisation (L19) demands an enormous amount of arithmetic. Almost all of that arithmetic is independent matrix work (L14), which means it can be done in parallel. Parallel hardware does it at a scale a single serial worker never could. And that scale is what makes a model with hundreds of billions of parameters trainable at all.

flowchart LR D["data
trillions of tokens"]:::vec --> MM["matrix operations
mostly independent"]:::m --> PH[["parallel hardware
many workers at once"]]:::p --> F["training at scale
that would otherwise be impossible"]:::good classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef m fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef p fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef good fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 20.6. The training pipeline, in one line. Data drives matrix operations; those operations are independent enough to run on parallel hardware; parallel hardware makes the scale practical. Remove the parallelism and modern foundation models would not exist, because the same work on a serial machine would take lifetimes.

What this lesson does and doesn't claim

Three things to keep precise, because they're easy to overstate. GPUs are not always faster: for a single small, branchy, decision-heavy task, a CPU's low latency wins, and a GPU sitting idle between bursts of identical work is wasted silicon. Not everything parallelises: any job with a chain of dependencies has a serial backbone that more workers cannot shorten. And parallelism does not remove scaling limits: it lets you bring more hardware to bear, but communication between workers, the cost of combining results, and the unavoidable serial fraction all mean that doubling the hardware rarely doubles the speed.

That last point is the door into the next lesson. Parallelism answers "how do we use more hardware at all." It does not answer "how much do we actually gain when we do."

compression · what to carry forward

Serial work is one thing after another; parallel work is many independent things at once. Parallelism helps only when the pieces don't depend on each other.
A chain of dependencies has a serial backbone that more workers can't shorten, and even a small serial fraction caps the achievable speedup.
AI parallelises unusually well because its dominant operation, matrix multiplication (L14), is a huge number of independent arithmetic operations.
A CPU is a few powerful, flexible cores (good at varied, branchy work, low latency); a GPU is thousands of simple cores doing the same operation on many numbers at once (SIMD, high throughput).
Data parallelism: one model, many copies, different batches, gradients combined. Model parallelism: split a too-large model across machines.
Latency is how fast one task finishes; throughput is how much total work gets done. AI is a throughput problem, which is the bet GPUs make.
Parallelism is the bridge from elegant theory to practical scale; without it, foundation models would not be trainable.
Parallelism lets you use more hardware; it does not guarantee proportional gains, because of dependencies, communication, and the serial fraction.

What you should be able to do now

Define serial and parallel execution and say when each is forced on you.
Explain why dependency chains limit parallelism, and why a serial fraction caps speedup.
Explain why matrix multiplication is unusually parallel-friendly, tying back to L14.
Contrast CPU and GPU design philosophies, and say which workloads each suits.
Describe data parallelism and model parallelism in plain terms, without implementation detail.
Distinguish throughput from latency and say why AI training is a throughput problem.
State, precisely, the limits of parallelism (no universal GPU win, no infinite scaling).

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace mechanism; don't summarise.

L20 Explain serial versus parallel execution, and say what decides whether a job can be parallelised at all.

Serial execution does one thing after another on a single worker: the total time is the sum of the parts, and nothing starts until the previous step finishes. Parallel execution runs many pieces at the same time on separate workers: if the pieces are independent, the total time drops toward the time of a single piece rather than the sum. The supermarket version is one cashier (serial) versus twenty lanes (parallel); the work per customer is unchanged, but far more happens at once. What decides whether a job can be parallelised is dependency. If the pieces are independent (grade a thousand exams, process a thousand photos), you can hand them to many workers and finish roughly N times faster. If each step needs the result of the one before it (a strict sequence, like pour foundation, then frame, then roof), the order is fixed and more workers can't help. Most real jobs are a mix, and the serial part is the limiter: even a small fraction that must run in order caps how much faster extra workers can make the whole thing, because that part stays the same length no matter how many workers you add.

L20 Explain why matrix multiplication parallelises so well, and why that matters for AI specifically.

A large matrix multiply is built from an enormous number of small arithmetic operations, and most of them don't depend on each other. Each entry of the output is its own independent sum of products: computing it needs only the input numbers, not the value of any other output entry. So thousands of those little sums can be computed at the same time, on thousands of workers, with nobody waiting on anyone else. That's the ideal case for parallel hardware: huge amounts of independent work and almost no dependency chain. It matters for AI specifically because matrix multiplication is the dominant operation in a neural network (L14): training and inference spend almost all their time multiplying big matrices. So the thing models do most is exactly the thing that parallelises best, which is why throughput hardware like GPUs can train them at all. It also explains a piece of history: the architectures that won were matmul-heavy, partly because matmul was the operation the available parallel hardware could run fast. The maths and the machine reinforced each other.

↩ L19 Interleaved. How does optimisation, from L19, create the need for parallelism?

Optimisation is the training loop from L19: compute the gradient of the loss, take a small step downhill, repeat. The loop is simple, but its scale is enormous. The gradient is a direction in hundreds of billions of parameters, it's averaged over trillions of tokens, and progress comes one small step at a time, so a run takes hundreds of thousands to millions of steps. None of the arithmetic is hard; there is just an astronomical amount of it, and almost all of it is the same operation (matrix multiplication) repeated. When the bottleneck is volume of identical, independent work rather than the difficulty of any one operation, the way to go faster is to do many copies at once rather than invent a cleverer step. That's exactly what parallelism provides. So optimisation creates the need: it specifies a colossal pile of independent arithmetic, and parallel hardware is what makes finishing that pile possible in a reasonable time. The downhill walk explains how a model improves; parallelism explains how you can afford to take billions of those steps. Without it, the same optimisation on a serial machine would take lifetimes.

↳ ahead Forward. Parallelism lets you throw more hardware at training. Predict why adding more hardware might not produce a proportional gain.

Because not all of the work is perfectly independent, and the parts that aren't set a ceiling. Three reasons. First, the serial fraction: any piece of the job that must run in order doesn't get faster with more workers, so once the parallel part is spread thin, the serial part dominates and the speedup flattens. Second, communication overhead: workers have to coordinate, and in data parallelism the gradients from every copy must be gathered and combined each step; that talking-to-each-other cost grows as you add workers and eventually eats the benefit. Third, diminishing returns in the work itself: even if you could run twice as fast, it's a separate question whether a model trained with twice the compute is twice as good, and the answer is usually no. So doubling the hardware rarely doubles the result; you get gains, but they taper. That gap between "more hardware" and "proportional gain" is exactly what the next lesson is about: compute scaling asks, if we double the hardware (or the compute, or the data), what do we actually get back, and the honest answer is less than double.

L20 A colleague says "GPUs are just faster than CPUs, full stop." Correct them, using throughput and latency.

It depends entirely on the workload, and the words to make that precise are latency and throughput. Latency is how long a single task takes start to finish; throughput is how much total work gets done per unit of time. A CPU has a few powerful, flexible cores optimised for low latency: hand it one complicated, branchy, decision-heavy task and it finishes quickly, and it's the right tool for operating systems and general logic where the next step depends on the last. A GPU has thousands of simple cores optimised for throughput: hand it a huge pile of identical, independent arithmetic and it clears the pile faster than anything else, even though any single one of those operations might take longer than it would on the CPU. So a GPU is not "just faster." For one small serial task it can be slower, and a GPU fed branchy, dependency-heavy work sits mostly idle and wastes silicon. The GPU wins precisely when the job is bulk parallel arithmetic, which is what AI training is (trillions of tokens of matrix work). The correct statement is that GPUs win on throughput-bound, parallel workloads and lose on latency-bound, serial ones; AI happens to be the former.

Next station

You now have the reason hardware sits at the centre of modern AI: optimisation demands a mountain of independent arithmetic, and parallelism is how that mountain gets climbed in finite time. Parallelism tells you that you can use more hardware. It doesn't tell you how much you gain when you do. Doubling the machines does not double the result, and understanding why, and what the returns to scale actually look like, is the question that closes the phase. L21 puts the lever and machine on the wall.

← Lesson 19 Lesson 21 →