PHASE 1 · FOUNDATIONS OF INTELLIGENCE
05 / 78

Learning paradigms and optimisation signals

Lesson 5. Phase 1: Foundations of intelligence. ~24 min read + cards + retrieval. Durability tier 1 (bedrock).

🧰
Memory palace · Bench · station 5
The toolbox. Four drawers (supervised, unsupervised, self-supervised, reinforcement), each holding a different shape of optimisation signal.
Core idea. Optimisation can only carve structures the learning signal exposes, so each learning paradigm builds a different family of representations depending on what feedback the world makes available.

Why this lesson exists

Two AI systems can share the exact same architecture, sit on the same silicon, and run the same kernels, yet produce wildly different behaviour because they were trained with different learning signals. An ImageNet classifier and a CLIP-style vision-language model can use identical backbones; what separates them is what the loss function asked the optimiser to make true.

Modern AI lumps together 4 distinct training regimes: supervised, unsupervised, self-supervised, reinforcement. Each one exposes the optimiser to a different shape of gradient. Choose a different shape, get a different optimised state.

Supervised learning

Supervised learning is the cleanest case. Each training input comes with a target. Predict, measure error, update. Dense, immediate, unambiguous. ImageNet's million labelled images train a classifier to match predicted labels to true ones, with every example contributing a clean gradient on every step.

The cost is labels. ImageNet itself was a multi-year, thousands-of-annotators undertaking. Medical-grade segmentation costs tens of dollars per image. Anything that needs expert judgement bottlenecks at labelling long before it bottlenecks at compute.

Failure case: distribution shift. The 2010-spam classifier from L3 learned the labels it had; the labels didn't include what spammers later invented.

Unsupervised learning

Unsupervised learning has no labels. The signal comes from the data's own structure: clusters, modes, low-dimensional summaries. The win is that anything you can collect, you can train on.

The cost is alignment with what you wanted. K-means on a million product reviews finds clusters, but those clusters might be "long vs short", "English vs Spanish", or "five-star vs one-star". The objective doesn't know which axis matters. It minimises what you wrote down, and what you wrote down rarely matches what you actually want.

Historically, unsupervised methods alone produced weaker representations than supervised baselines. Useful for preprocessing, anomaly detection, and exploration. Not the engine of modern AI.

Failure case: clusters perfectly tight by the loss function and meaningless for the downstream task.

Self-supervised learning

Self-supervised learning is the synthesis. Take an enormous unlabelled corpus. Make the system predict parts of the input from other parts. Every example contributes a dense supervised-style gradient, but the labels were already in the data.

Next-token prediction is the dominant variant: a GPT-style model takes long text and predicts each next token from the past. A 2 trillion-token corpus produces order of 1014 gradient updates with no annotator cost. Masked language modelling and contrastive vision-language are sibling variants on the same idea: free supervision from data structure.

The win is that the labelling bottleneck disappears. Scaling becomes compute-bounded and bandwidth-bounded instead of label-bounded. This is the mechanical reason self-supervised learning rewrote the field around 2018-2020.

The cost is that the proxy task may not match the downstream task. A model excellent at predicting next tokens isn't automatically excellent at answering questions truthfully or following instructions. The whole post-training stack (fine-tuning, RLHF) exists to translate proxy capability into target capability.

Failure case: shortcuts. CLIP latched onto OCR-style text inside images because reading the word "dog" from a caption is easier than learning what a dog looks like. The gradient went where the structure was easiest, not where the structure was deepest.

Reinforcement learning

Reinforcement learning changes the shape entirely. No labels and no data already in the world. The system (now called an agent) takes actions in an environment and occasionally receives a reward. The signal is sparse, delayed, and shaped by the agent's own choices about what to do next.

An AlphaGo policy network playing Go gets one reward at the end of a 200-move game: win or lose. That single bit has to be back-propagated across the trajectory, attributing credit to every decision. This is credit assignment, and it's why RL is hard.

Exploration is the second axis. A supervised model is shown what to learn from. An RL agent has to discover what to learn from by trying things. Too much exploration and you learn nothing; too much exploitation of current strategy and you miss better ones. The trade-off is tuned per problem.

The win is that RL fits when the right action can only be discovered, not labelled: games, robot locomotion, dialogue policies, RLHF post-training of language models. The cost is sample efficiency. A naive RL agent in a complex environment may need billions of interactions to reach competence comparable to a supervised vision model trained on millions of examples. Modern RL leans on simulation: reproduce the environment in software and train at 1000× wall-clock speed.

Failure case: reward hacking. The agent finds a way to maximise the reward signal that doesn't match what the designer intended. Vacuum robots told to "cover floor area" can spin in place. RLHF models told to "be helpful" can become sycophantic. The optimiser is doing exactly what the signal asked.

Signal density

Each paradigm produces a different shape of gradient information per training step.

Self-supervised is the densest: every token labels the position before it. A 2T-token corpus generates 2T supervised-style updates without paying for any of them. Supervised is dense but bounded: 1M labels gives roughly 1M updates per epoch. RL is sparse: one scalar at the end of an episode distributed back across hundreds of decisions. Unsupervised is dense but informationally weak: tells you which group, not what about the group matters.

Data economics

The systems view of why the field looks the way it does.

Pre-2017, supervised learning dominated. Labelling was the bottleneck. Architectures evolved to extract maximum signal from limited labels. 2017-2020, self-supervised learning at scale broke that bottleneck. Compute and data quality became the constraints. Models exploded in size. The transformer won partly on merit, partly because it parallelises well enough to absorb the data flood. 2020 onward, RL post-training closed the loop. Self-supervision builds the base representations. Supervised fine-tuning shapes them toward useful behaviour. RL optimises against the actual objective. Modern LLM pipelines use 3 of the 4 paradigms.

Synthetic data sits at the intersection. When real data is expensive (medical, robotics) or sensitive (legal, defence), generated examples let you pay compute instead of labelling cost. The trade-off is whether the synthetic distribution matches reality closely enough to transfer.

Hardware interaction

Each paradigm has a different relationship with silicon. Self-supervised maps cleanly to GPU throughput. Long sequences, dense per-token loss, contiguous matmuls. The hardware stays busy. Training is bandwidth-limited at the interconnect, not compute-limited at the cores.

RL often underutilises hardware. Environment steps are sequential, frequently CPU-bound, capped by simulator speed. The GPU sits idle waiting for experience. The systems fix is hundreds or thousands of parallel simulators feeding batched experience to the GPU at throughput. Half of modern RL infrastructure is plumbing for this.

Simulation throughput became its own infrastructure problem. A faster simulator (CPU cycles per environment step) is load-bearing in any RL stack. Compiler optimisation for simulation code, JIT-compiled physics engines, GPU-accelerated rasterisation: all forced by RL's hardware-utilisation profile.

Data pipelines became serious engineering for self-supervised training. Feeding a 70B-parameter model with 2T tokens at training rate is a non-trivial I/O and preprocessing problem. The pipeline is bandwidth-shaped and the cluster's storage layer can become the rate-limiter.

Four signals, side by side

Figure 5.1 puts the 4 paradigms on the same page. Each panel shows the input the optimiser sees, the signal it gets back, the density of that signal, and the shape of the resulting behaviour. The contrast makes the meta-claim visible: the optimiser can only move toward information the signal exposes.

SUPERVISED labelled pairs input cat dog car signal ŷ ≈ y ? forward predict error → back-prop density per-example feedback immediate behaviour cat dog car clean decision boundaries fails on distribution shift UNSUPERVISED structure only input no labels signal "what's similar?" grouping by proximity / density no per-example target density weak info feedback immediate behaviour 3 emergent clusters clusters may not match goal SELF-SUPERVISED predict held-out parts input the cat sat on ? mat predict token at "?" label = whatever was there signal p(token | context) cross-entropy vs actual free labels at scale density per-token feedback immediate behaviour rich many-feature embedding proxy ≠ downstream task REINFORCEMENT sparse delayed reward input agent env action state, r signal 200 actions → 1 reward bit credit assignment problem density episodic feedback delayed behaviour policy: state → action reward hacking risk the optimiser can only move toward information the signal exposes
FIG 5.1. Four paradigms side by side. Same template: input, signal, density, feedback timing, behaviour shape. Filled density dots mark dense signal; open dots mark sparse. The optimiser can only carve structures the signal exposes; each panel shows what each paradigm exposes.

The L1 to L4 view

In L1's loop, the optimisation step is what learns; the objective determines what it minimises. Different signal, different optimised state. In L2's terms, every paradigm is doing prediction-and-compression on something: supervised predicts labels from inputs; self-supervised predicts parts of input from other parts; RL predicts which actions lead to reward; unsupervised compresses the data distribution itself.

In L3's terms, generalisation depends on whether the regularities the objective rewards match the regularities the deployment task needs. In L4's terms, emergence is signal-shaped: in-context learning and chain-of-thought are products of next-token prediction's particular signal shape, while emergence in RL has a different texture (strategy formation, sudden exploration jumps, sometimes brittle because the discovery happened along one self-determined trajectory).

The takeaway

The optimiser follows the signal. The signal is shaped by the paradigm. The paradigm is shaped by what feedback the world makes available and what you choose to collect or generate.

When you meet a new AI system, the first useful question is: what was the learning signal. The answer tells you what kinds of structure are encodable, what capabilities to expect, and where the failure modes will live.

The toolbox sits on the bench. Four drawers, four signal shapes, four families of representations.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L5 You want to build a system that detects fraudulent credit-card transactions in real time. Compare how you'd approach this with (a) supervised learning, (b) self-supervised learning, and (c) reinforcement learning. For each, describe the signal you'd give the optimiser, the practical bottleneck, and one failure mode the chosen paradigm would expose.
(a) Supervised. Collect transactions labelled fraud / not-fraud by investigators or chargeback records. The signal is dense per example (every transaction is a labelled pair). Bottleneck: labels are scarce and noisy (some fraud is never detected; some labelled fraud is false alarm). Failure mode: distribution shift when fraudsters change tactics, because the model only learned the regularities present in past labels. (b) Self-supervised. Train a model to predict held-out fields of a transaction from the others (predict the merchant from card-id, time, amount; predict next transaction from history). The signal is dense per-transaction and free. Bottleneck: the proxy task is "predict normal transaction structure," which is one step removed from "detect fraud." You then build a downstream fraud head using a smaller labelled set, treating high prediction error as a fraud signal. Failure mode: shortcuts in the proxy that don't generalise (e.g. the model learns merchant codes as identity giveaways and fails on novel merchants). (c) Reinforcement learning. Treat the system as an agent that takes actions (allow, hold for review, block) and gets reward later (correct block: positive; false block annoying a legitimate customer: negative). The signal is sparse and delayed (reward arrives only after the chargeback period). Bottleneck: real-world exploration is hostile (every "try blocking this" is a customer-impact decision). You'd need a simulator of the customer-fraudster ecosystem to train at scale. Failure mode: reward hacking; the agent learns to never block, because blocking has visible negative consequences and missed fraud has diffuse ones. The 3 together suggest why production fraud systems are usually self-supervised pretraining plus supervised heads, with RL used sparingly inside controlled feedback loops.
L5 A team trains a vision-language model with contrastive self-supervised learning on internet image-caption pairs and gets strong performance on benchmarks. Their colleague trains the same architecture with supervised classification on a curated medical-image dataset and gets strong in-domain performance but poor transfer. Without using any maths, explain mechanistically why the 2 systems end up with such different transfer profiles, in terms of what each optimisation signal exposed to the model.
The contrastive self-supervised model received a signal that rewarded representations matching across many image-text pairs spanning hundreds of millions of examples and thousands of domains. The optimiser was pushed toward features that survive distribution shift, because the training distribution was the variation; nothing in the loss function privileged any specific domain. The resulting representations encode broadly useful structure (object identity, scene type, text-image correspondence) because that's what made the contrastive loss low across the heterogeneous training distribution. The supervised medical model received a signal that rewarded predicting specific medical labels on a narrow curated dataset. The optimiser was pushed toward features that discriminate within that specific distribution; nothing in the loss function rewarded representations that would survive a different imaging modality, hospital, or patient population. The resulting representations encode high-resolution within-distribution structure (the specific cues that separate condition A from condition B in that scanner) and ignore everything else. Transfer comes from breadth of signal in training. The contrastive model got breadth for free from the proxy task and the corpus. The supervised model got depth without breadth because the labels paid for narrow precision instead.
↳ L6 (Forward interleave to L6, sequential decision making and reward.) Take the RL fraud-detection sketch from Q1. Without using any RL maths, identify which structural features of the problem make it a sequential decision-making problem (rather than a classification problem), what role the notion of "state" plays, how reward delay affects the learning signal, and what changes if the system has to act on a transaction in 50 milliseconds rather than after seeing the next month of transactions.
The classification framing treats each transaction in isolation: input → label. The sequential framing treats the system as making a series of decisions over time, each of which both produces an immediate effect and changes the situation in which the next decision will be made. The "state" is everything the agent needs to know to decide the next action: current transaction features, recent transaction history for this card, current fraud-rate trends, the agent's own recent block decisions and their outcomes. State carries the past forward in a compact form so the current action can be conditioned on it. Reward delay is the wedge between action and feedback: blocking a transaction now produces no immediate reward; the chargeback evidence (or customer complaint) arrives days later. The agent has to commit to a policy without knowing for tens or hundreds of decisions whether it was right. Credit assignment across that delay is the hard part. The 50 ms latency change is the real-time constraint: the policy has to be cheap to evaluate at inference time, which limits how much state can be processed and how complex the policy can be. RL training is offline (or in simulation) and may be expensive; the trained policy is what runs in 50 ms. This sets up L6, which formalises state, action, reward, and policy and treats the trade-offs between immediate and long-run consequences as a first-class problem.

Next station

Lesson 6 sits at the maze on the wall above the bench (station 6) and treats reinforcement learning as a foundational paradigm of intelligence in its own right. State, action, reward, policy: the load-bearing pieces of any sequential decision problem, and the substrate that L52 (RLHF) and L70 (reasoning models) will build on later.