A system that only memorises cannot survive contact with a changing world. The training set is always finite; the world the system meets afterwards is not. Without the ability to extract regularities that transfer, every novel input becomes either a near-match to something in storage (lucky) or a guess (random).
Modern AI's leap from "impressive demos" to "useful tools" hinged on this distinction. Systems whose internal representations capture reusable structure work outside the distribution they were trained on. Systems that just store examples don't.
A lookup table is the limit case of memorisation. Given an input that exactly matches an entry, it returns the stored output. Given anything else, it fails. Useful for hash tables. Useless for intelligence.
A lookup-table-shaped system can still look impressive briefly. Eliza, the 1960s chatbot, was essentially template matching. It looked like understanding because conversation has heavy structure and templates can hit the surface. Push slightly outside the patterns and the system collapses.
The distinction this lesson installs: memorisation stores examples; generalisation extracts the regularities those examples share. The regularities are what transfer. The examples don't.
A system processing data falls into one of three modes:
Modern AI cares about mode 3.
Chess openings. A novice memorises a few opening lines: Ruy Lopez moves 1-8. They play well in those openings and freeze when the opponent deviates on move 4. A strong player has memorised some openings and extracted abstractions: control the centre, develop pieces before pawns, watch the diagonals. The abstractions handle the move-4 deviation; the memorised lines don't.
Spam filters. Early spam filters were keyword blacklists. See "Viagra" β mark as spam. Spammers adjusted with "V1agra" β reset the blacklist. The arms race ran until filters built on character-level patterns, syntactic features, and sender reputation. The abstract features catch spam regardless of which specific spelling tricks the spammers use this week. The blacklist was memorisation; the modern filter is generalisation.
Image classifiers. A network trained on ImageNet learns to recognise cats. If it only memorised the training cats, it would fail on any cat photographed from a new angle, in new lighting, in a new pose. The fact that modern CNNs and vision transformers handle new cats well is empirical evidence that they have, somewhere in their weights, an abstract notion of "cat-shape" that survives the surface variations.
AlphaGo and AlphaZero. The original AlphaGo learned in part from human games (some memorisation of human opening strategies). AlphaZero learned from self-play with zero human games and ended up stronger. Why? Because AlphaZero's policy network had to generalise from the structure of the game itself, not from the surface patterns humans happened to play. With no human games to memorise, it built abstractions about position value that transferred to positions humans had never reached.
Branch predictors and cache prefetchers. You met these in L2. They generalise too. A branch predictor doesn't have a stored entry for every conditional in every program; there are far more programs than predictor table entries. It uses small histories and aliasing to apply learned patterns across new programs. The predictor that does well across the SPEC benchmark suite is the one whose internal abstractions transfer to programs it was never tuned on. Hardware has been doing generalisation in tiny budgets for decades.
LLMs. A frontier language model trained on hundreds of billions of tokens does both things. Some prompts produce near-verbatim memorised content from training. Most prompts produce novel sequences that didn't exist in the training data but match its style, syntax, and reasoning patterns. The interesting half is the second. Memorisation is the failure mode that researchers actively try to detect; generalisation is the goal.
L2 said compression and prediction are two views of one operation. Generalisation is the same operation with a transfer requirement attached. A system that compresses well on training data has found regularities. Whether those regularities transfer depends on whether they reflect the actual underlying structure of the data or just the structure of the specific training sample.
A short description that explains all the training data but fails on new data is overfitting. A short description that explains training data and extrapolates well to new data is generalisation. The shorter the explanation, all else equal, the more likely it's the second kind. This is the deep reason Occam's razor works in practice for machine learning.
The choice of model architecture and training procedure imposes an inductive bias: an assumption about which regularities are likely. A CNN's architecture has spatial locality built in. Nearby pixels are related; weights are shared across positions; translation equivariance is structural. This bias makes CNNs excellent at images and poor at things where spatial locality doesn't hold. Transformers have a much weaker bias (any token can attend to any other), which is why they require vastly more data and compute, and why they generalise across many domains once given that data.
Inductive bias is the difference between "the system found a pattern because we gave it a head start" and "the system found a pattern from scratch." The right bias for the problem makes generalisation cheap; the wrong bias makes it nearly impossible.
A model trained on one distribution often fails on another, even when the underlying task is the same. A robot policy trained in simulation often fails on the real robot because the simulator's pixels and physics are different from reality, even if the task ("pick up the cube") is the same. A medical AI trained on one hospital's scans often fails on another's because the imaging machines differ. A spam filter trained on 2010 spam fails on 2025 spam because spammers adapted.
Call this a failure of assumption rather than a failure of model. The model did what it was trained to do: minimise error on its training distribution. The mismatch is between that distribution and the world the model later met. This is the phenomenon called distribution shift, and most production AI failures are some form of it.
The fixes are some combination of: more diverse training data (more distributions in training), better inductive bias (so the model relies on structure that's stable across distributions), or explicit adaptation at test time. The principle: a model is only as general as the variation it has actually seen, plus whatever inductive bias bridges it to variations it hasn't.
Figure 3.1 contrasts the three modes against the same kind of test: a familiar input and a shifted input. The contrast makes the cost of memorisation under variation visible.
In the L1 loop, the representation is where generalisation happens. The whole point of building an internal representation is that it captures the regularities you want to transfer. The optimisation step shapes the representation; the objective function determines which regularities get captured.
A bad objective (for example, "minimise error on training data, end of story") will produce a representation that memorises. A good objective plus the right inductive bias produces a representation that abstracts. This is why the field cares so much about training objectives, regularisation, dropout, weight decay, data augmentation, and the rest of the post-training stack. Each is an attempt to keep the representation from collapsing into memorisation when it could be doing abstraction.
Modern AI became powerful when systems stopped merely storing examples and began building transferable internal representations. The shift wasn't a new algorithm. It was the realisation that the same prediction objective, applied to enough data with the right inductive bias and enough compute, produces a model whose representations abstract rather than memorise. Everything in Phase 2 onwards (gradients, optimisation landscapes, loss functions, scaling laws) is the apparatus that makes this happen reliably.
The blank page taught you that prediction is the engine. The folded map teaches you that the engine only matters if what it builds inside survives the journey out of the training distribution.
Lesson 4 sits at the dust on the bench (station 4) and looks at the strangest fact in this section. Capabilities don't always appear smoothly as systems scale; sometimes more becomes different. That's emergence, and the honest treatment of it is L4.