Generalisation, memorisation, and abstraction

Lesson 3. Phase 1: Foundations of intelligence. ~24 min read + cards + retrieval. Durability tier 1 (bedrock).

🗺️

Memory palace · Bench · station 3

The folded map. A compressed representation of terrain that survives travelling through unfamiliar places.

Core idea. Generalisation is the ability to exploit underlying structure and transfer it to unseen cases. Memorisation stores specific examples without extracting the deeper regularities that would survive contact with new data.

Why this lesson exists

A system that only memorises cannot survive contact with a changing world. The training set is always finite; the world the system meets afterwards is not. Without the ability to extract regularities that transfer, every novel input becomes either a near-match to something in storage (lucky) or a guess (random).

Modern AI's leap from "impressive demos" to "useful tools" hinged on this distinction. Systems whose internal representations capture reusable structure work outside the distribution they were trained on. Systems that just store examples don't.

The lookup-table limit

A lookup table is the limit case of memorisation. Given an input that exactly matches an entry, it returns the stored output. Given anything else, it fails. Useful for hash tables. Useless for intelligence.

A lookup-table-shaped system can still look impressive briefly. Eliza, the 1960s chatbot, was essentially template matching. It looked like understanding because conversation has heavy structure and templates can hit the surface. Push slightly outside the patterns and the system collapses.

The distinction this lesson installs: memorisation stores examples; generalisation extracts the regularities those examples share. The regularities are what transfer. The examples don't.

Three modes

A system processing data falls into one of three modes:

Stores examples. Each training input sits in memory, intact. New inputs get a nearest-match treatment. Works for problems where the test set is the training set.
Compresses patterns. Recurring fragments get short codes. The system finds repeated motifs but doesn't extract why they recur. Works for many compression tasks and for prediction inside the training distribution.
Builds reusable abstractions. The system captures the underlying generators of the data, not just the surface patterns. Abstract representations survive variation in surface form. New inputs that share the underlying structure but not the surface form get handled correctly.

Modern AI cares about mode 3.

Examples once you see the pattern

Chess openings. A novice memorises a few opening lines: Ruy Lopez moves 1-8. They play well in those openings and freeze when the opponent deviates on move 4. A strong player has memorised some openings and extracted abstractions: control the centre, develop pieces before pawns, watch the diagonals. The abstractions handle the move-4 deviation; the memorised lines don't.

Spam filters. Early spam filters were keyword blacklists. See "Viagra" → mark as spam. Spammers adjusted with "V1agra" → reset the blacklist. The arms race ran until filters built on character-level patterns, syntactic features, and sender reputation. The abstract features catch spam regardless of which specific spelling tricks the spammers use this week. The blacklist was memorisation; the modern filter is generalisation.

Image classifiers. A network trained on ImageNet learns to recognise cats. If it only memorised the training cats, it would fail on any cat photographed from a new angle, in new lighting, in a new pose. The fact that modern CNNs and vision transformers handle new cats well is empirical evidence that they have, somewhere in their weights, an abstract notion of "cat-shape" that survives the surface variations.

AlphaGo and AlphaZero. The original AlphaGo learned in part from human games (some memorisation of human opening strategies). AlphaZero learned from self-play with zero human games and ended up stronger. Why? Because AlphaZero's policy network had to generalise from the structure of the game itself, not from the surface patterns humans happened to play. With no human games to memorise, it built abstractions about position value that transferred to positions humans had never reached.

Branch predictors and cache prefetchers. You met these in L2. They generalise too. A branch predictor doesn't have a stored entry for every conditional in every program; there are far more programs than predictor table entries. It uses small histories and aliasing to apply learned patterns across new programs. The predictor that does well across the SPEC benchmark suite is the one whose internal abstractions transfer to programs it was never tuned on. Hardware has been doing generalisation in tiny budgets for decades.

LLMs. A frontier language model trained on hundreds of billions of tokens does both things. Some prompts produce near-verbatim memorised content from training. Most prompts produce novel sequences that didn't exist in the training data but match its style, syntax, and reasoning patterns. The interesting half is the second. Memorisation is the failure mode that researchers actively try to detect; generalisation is the goal.

Abstraction as compression of regularities

L2 said compression and prediction are two views of one operation. Generalisation is the same operation with a transfer requirement attached. A system that compresses well on training data has found regularities. Whether those regularities transfer depends on whether they reflect the actual underlying structure of the data or just the structure of the specific training sample.

A short description that explains all the training data but fails on new data is overfitting. A short description that explains training data and extrapolates well to new data is generalisation. The shorter the explanation, all else equal, the more likely it's the second kind. This is the deep reason Occam's razor works in practice for machine learning.

Inductive bias

The choice of model architecture and training procedure imposes an inductive bias: an assumption about which regularities are likely. A CNN's architecture has spatial locality built in. Nearby pixels are related; weights are shared across positions; translation equivariance is structural. This bias makes CNNs excellent at images and poor at things where spatial locality doesn't hold. Transformers have a much weaker bias (any token can attend to any other), which is why they require vastly more data and compute, and why they generalise across many domains once given that data.

Inductive bias is the difference between "the system found a pattern because we gave it a head start" and "the system found a pattern from scratch." The right bias for the problem makes generalisation cheap; the wrong bias makes it nearly impossible.

Distribution shift

A model trained on one distribution often fails on another, even when the underlying task is the same. A robot policy trained in simulation often fails on the real robot because the simulator's pixels and physics are different from reality, even if the task ("pick up the cube") is the same. A medical AI trained on one hospital's scans often fails on another's because the imaging machines differ. A spam filter trained on 2010 spam fails on 2025 spam because spammers adapted.

Call this a failure of assumption rather than a failure of model. The model did what it was trained to do: minimise error on its training distribution. The mismatch is between that distribution and the world the model later met. This is the phenomenon called distribution shift, and most production AI failures are some form of it.

The fixes are some combination of: more diverse training data (more distributions in training), better inductive bias (so the model relies on structure that's stable across distributions), or explicit adaptation at test time. The principle: a model is only as general as the variation it has actually seen, plus whatever inductive bias bridges it to variations it hasn't.

Three modes side by side

Figure 3.1 contrasts the three modes against the same kind of test: a familiar input and a shifted input. The contrast makes the cost of memorisation under variation visible.

FIG 3.1. Three modes side by side. Same familiar input and same shifted input applied to each system. Memorisation gets a lucky surface match in-distribution and fails under shift. Compression handles in-distribution well but its clusters don't generalise to shifted inputs. Abstraction handles both because the rule is over learned features, not over surface forms.

Back to the loop

In the L1 loop, the representation is where generalisation happens. The whole point of building an internal representation is that it captures the regularities you want to transfer. The optimisation step shapes the representation; the objective function determines which regularities get captured.

A bad objective (for example, "minimise error on training data, end of story") will produce a representation that memorises. A good objective plus the right inductive bias produces a representation that abstracts. This is why the field cares so much about training objectives, regularisation, dropout, weight decay, data augmentation, and the rest of the post-training stack. Each is an attempt to keep the representation from collapsing into memorisation when it could be doing abstraction.

The takeaway

Modern AI became powerful when systems stopped merely storing examples and began building transferable internal representations. The shift wasn't a new algorithm. It was the realisation that the same prediction objective, applied to enough data with the right inductive bias and enough compute, produces a model whose representations abstract rather than memorise. Everything in Phase 2 onwards (gradients, optimisation landscapes, loss functions, scaling laws) is the apparatus that makes this happen reliably.

The blank page taught you that prediction is the engine. The folded map teaches you that the engine only matters if what it builds inside survives the journey out of the training distribution.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L3 Compare 3 systems trying to handle a new input: (a) a lookup-table-based spam filter trained on 2010 spam, (b) a statistical n-gram model trained on the same data, (c) a modern neural classifier trained on the same data. The new input is a 2025 spam message with novel phrasing. Describe what each system does and which one (if any) handles the new input well, and why.

(a) The lookup table compares the new message to stored 2010 examples. Surface match is near-zero (new phrasing, new spam tactics), so the filter marks the message as not-spam (or relies on weak default behaviour). It fails because it has no mechanism to extract why spam looks like spam beyond the strings it has seen. (b) The n-gram model has compressed 2010 spam into statistics over short word sequences. Specific 2010 phrases won't appear, but some general structural patterns (capitals, suspicious links, certain syntactic markers) may still apply if the language hasn't drifted too far. Better than the lookup table, worse than (c), because its abstractions are shallow. (c) The neural classifier has learned higher-level abstract features: sender reputation patterns, semantic incoherence between subject and body, urgency markers, link suspicion. Many of these remain valid 15 years later because they reflect spam's adversarial structure, not its 2010-specific surface. The classifier generalises across the distribution shift to the extent its features captured the actual mechanism of spam rather than its 2010 disguise. The hierarchy (memorisation → compression → abstraction) maps directly to how each system handles distribution shift.

L3 A robotics team trains a manipulation policy entirely in simulation and observes that it fails when deployed on the real robot. From the L3 framing (memorisation, generalisation, inductive bias, distribution shift), what are 3 distinct fixes they could try, and what each fix would do to the system's representation?

(1) Domain randomisation: vary the simulator's textures, lighting, friction coefficients, and camera positions during training so the policy sees a wider distribution. The representation is forced to ignore details that vary across simulations and rely on features that stay stable. Effectively, this turns simulation variability into part of what the policy must generalise over. (2) Stronger inductive bias toward physically meaningful features: architectures that work in coordinate frames, or that have built-in invariance to viewpoint, force the representation to encode what's physically constant rather than what's pixel-level present. The bias bridges sim and real where the data alone wouldn't. (3) Test-time adaptation: fine-tune the policy briefly on real-robot data, even unlabelled, so the representation adjusts to the actual deployment distribution. This acknowledges that sim-to-real has a real gap and that asking pure generalisation to bridge it is too much. All 3 are bridges between the training distribution and the deployment distribution; the choice depends on how much real-robot data is available and how much compute can be spent.

↳ L4 (Forward interleave to L4, emergence.) A neural net trained on next-token prediction shows essentially zero ability to do 3-digit arithmetic for the first months of training; then, somewhere around a particular scale, it starts answering correctly. The capability didn't exist; then it did. From your understanding of memorisation and generalisation, suggest at least 2 mechanically plausible explanations for what changed during that transition, and what additional information you would want before forming a confident view.

Plausible mechanisms: (1) The representation crossed a threshold where addition-relevant abstractions (positional value, carry operations, digit-by-digit alignment) became encodable in the available capacity. Below the threshold, the model could only memorise specific arithmetic examples present in training; above it, an abstract algorithm became implementable in the weights, and the model started generalising to unseen number combinations. (2) The optimisation found a basin in the loss landscape corresponding to a more general algorithm only once enough gradient information had accumulated; before that, simpler memorising heuristics had lower loss for the available compute. (3) The capability was always nascent in the representation but masked by noise until enough updates sharpened the relevant pathways. To form a confident view, ask for: evaluation on out-of-distribution number ranges (does it actually generalise, or just appear to?), interpretability probes of the relevant attention heads or MLP blocks during the transition, and ablation studies that toggle specific training data subsets to see which were necessary. The phenomenon is real and the topic is called emergence; L4 treats it head-on.

Next station

Lesson 4 sits at the dust on the bench (station 4) and looks at the strangest fact in this section. Capabilities don't always appear smoothly as systems scale; sometimes more becomes different. That's emergence, and the honest treatment of it is L4.

← Lesson 2 Lesson 4 →