Two AI systems can share the exact same architecture, sit on the same silicon, and run the same kernels, yet produce wildly different behaviour because they were trained with different learning signals. An ImageNet classifier and a CLIP-style vision-language model can use identical backbones; what separates them is what the loss function asked the optimiser to make true.
Modern AI lumps together 4 distinct training regimes: supervised, unsupervised, self-supervised, reinforcement. Each one exposes the optimiser to a different shape of gradient. Choose a different shape, get a different optimised state.
Supervised learning is the cleanest case. Each training input comes with a target. Predict, measure error, update. Dense, immediate, unambiguous. ImageNet's million labelled images train a classifier to match predicted labels to true ones, with every example contributing a clean gradient on every step.
The cost is labels. ImageNet itself was a multi-year, thousands-of-annotators undertaking. Medical-grade segmentation costs tens of dollars per image. Anything that needs expert judgement bottlenecks at labelling long before it bottlenecks at compute.
Failure case: distribution shift. The 2010-spam classifier from L3 learned the labels it had; the labels didn't include what spammers later invented.
Unsupervised learning has no labels. The signal comes from the data's own structure: clusters, modes, low-dimensional summaries. The win is that anything you can collect, you can train on.
The cost is alignment with what you wanted. K-means on a million product reviews finds clusters, but those clusters might be "long vs short", "English vs Spanish", or "five-star vs one-star". The objective doesn't know which axis matters. It minimises what you wrote down, and what you wrote down rarely matches what you actually want.
Historically, unsupervised methods alone produced weaker representations than supervised baselines. Useful for preprocessing, anomaly detection, and exploration. Not the engine of modern AI.
Failure case: clusters perfectly tight by the loss function and meaningless for the downstream task.
Self-supervised learning is the synthesis. Take an enormous unlabelled corpus. Make the system predict parts of the input from other parts. Every example contributes a dense supervised-style gradient, but the labels were already in the data.
Next-token prediction is the dominant variant: a GPT-style model takes long text and predicts each next token from the past. A 2 trillion-token corpus produces order of 1014 gradient updates with no annotator cost. Masked language modelling and contrastive vision-language are sibling variants on the same idea: free supervision from data structure.
The win is that the labelling bottleneck disappears. Scaling becomes compute-bounded and bandwidth-bounded instead of label-bounded. This is the mechanical reason self-supervised learning rewrote the field around 2018-2020.
The cost is that the proxy task may not match the downstream task. A model excellent at predicting next tokens isn't automatically excellent at answering questions truthfully or following instructions. The whole post-training stack (fine-tuning, RLHF) exists to translate proxy capability into target capability.
Failure case: shortcuts. CLIP latched onto OCR-style text inside images because reading the word "dog" from a caption is easier than learning what a dog looks like. The gradient went where the structure was easiest, not where the structure was deepest.
Reinforcement learning changes the shape entirely. No labels and no data already in the world. The system (now called an agent) takes actions in an environment and occasionally receives a reward. The signal is sparse, delayed, and shaped by the agent's own choices about what to do next.
An AlphaGo policy network playing Go gets one reward at the end of a 200-move game: win or lose. That single bit has to be back-propagated across the trajectory, attributing credit to every decision. This is credit assignment, and it's why RL is hard.
Exploration is the second axis. A supervised model is shown what to learn from. An RL agent has to discover what to learn from by trying things. Too much exploration and you learn nothing; too much exploitation of current strategy and you miss better ones. The trade-off is tuned per problem.
The win is that RL fits when the right action can only be discovered, not labelled: games, robot locomotion, dialogue policies, RLHF post-training of language models. The cost is sample efficiency. A naive RL agent in a complex environment may need billions of interactions to reach competence comparable to a supervised vision model trained on millions of examples. Modern RL leans on simulation: reproduce the environment in software and train at 1000× wall-clock speed.
Failure case: reward hacking. The agent finds a way to maximise the reward signal that doesn't match what the designer intended. Vacuum robots told to "cover floor area" can spin in place. RLHF models told to "be helpful" can become sycophantic. The optimiser is doing exactly what the signal asked.
Each paradigm produces a different shape of gradient information per training step.
Self-supervised is the densest: every token labels the position before it. A 2T-token corpus generates 2T supervised-style updates without paying for any of them. Supervised is dense but bounded: 1M labels gives roughly 1M updates per epoch. RL is sparse: one scalar at the end of an episode distributed back across hundreds of decisions. Unsupervised is dense but informationally weak: tells you which group, not what about the group matters.
The systems view of why the field looks the way it does.
Pre-2017, supervised learning dominated. Labelling was the bottleneck. Architectures evolved to extract maximum signal from limited labels. 2017-2020, self-supervised learning at scale broke that bottleneck. Compute and data quality became the constraints. Models exploded in size. The transformer won partly on merit, partly because it parallelises well enough to absorb the data flood. 2020 onward, RL post-training closed the loop. Self-supervision builds the base representations. Supervised fine-tuning shapes them toward useful behaviour. RL optimises against the actual objective. Modern LLM pipelines use 3 of the 4 paradigms.
Synthetic data sits at the intersection. When real data is expensive (medical, robotics) or sensitive (legal, defence), generated examples let you pay compute instead of labelling cost. The trade-off is whether the synthetic distribution matches reality closely enough to transfer.
Each paradigm has a different relationship with silicon. Self-supervised maps cleanly to GPU throughput. Long sequences, dense per-token loss, contiguous matmuls. The hardware stays busy. Training is bandwidth-limited at the interconnect, not compute-limited at the cores.
RL often underutilises hardware. Environment steps are sequential, frequently CPU-bound, capped by simulator speed. The GPU sits idle waiting for experience. The systems fix is hundreds or thousands of parallel simulators feeding batched experience to the GPU at throughput. Half of modern RL infrastructure is plumbing for this.
Simulation throughput became its own infrastructure problem. A faster simulator (CPU cycles per environment step) is load-bearing in any RL stack. Compiler optimisation for simulation code, JIT-compiled physics engines, GPU-accelerated rasterisation: all forced by RL's hardware-utilisation profile.
Data pipelines became serious engineering for self-supervised training. Feeding a 70B-parameter model with 2T tokens at training rate is a non-trivial I/O and preprocessing problem. The pipeline is bandwidth-shaped and the cluster's storage layer can become the rate-limiter.
Figure 5.1 puts the 4 paradigms on the same page. Each panel shows the input the optimiser sees, the signal it gets back, the density of that signal, and the shape of the resulting behaviour. The contrast makes the meta-claim visible: the optimiser can only move toward information the signal exposes.
In L1's loop, the optimisation step is what learns; the objective determines what it minimises. Different signal, different optimised state. In L2's terms, every paradigm is doing prediction-and-compression on something: supervised predicts labels from inputs; self-supervised predicts parts of input from other parts; RL predicts which actions lead to reward; unsupervised compresses the data distribution itself.
In L3's terms, generalisation depends on whether the regularities the objective rewards match the regularities the deployment task needs. In L4's terms, emergence is signal-shaped: in-context learning and chain-of-thought are products of next-token prediction's particular signal shape, while emergence in RL has a different texture (strategy formation, sudden exploration jumps, sometimes brittle because the discovery happened along one self-determined trajectory).
The optimiser follows the signal. The signal is shaped by the paradigm. The paradigm is shaped by what feedback the world makes available and what you choose to collect or generate.
When you meet a new AI system, the first useful question is: what was the learning signal. The answer tells you what kinds of structure are encodable, what capabilities to expect, and where the failure modes will live.
The toolbox sits on the bench. Four drawers, four signal shapes, four families of representations.
Lesson 6 sits at the maze on the wall above the bench (station 6) and treats reinforcement learning as a foundational paradigm of intelligence in its own right. State, action, reward, policy: the load-bearing pieces of any sequential decision problem, and the substrate that L52 (RLHF) and L70 (reasoning models) will build on later.