Pattern, prediction, compression

Lesson 2. Phase 1: Foundations of intelligence. ~22 min read + cards + retrieval. Durability tier 1 (bedrock).

📄

Memory palace · Bench · station 2

The blank page. Latent structure waiting to be discovered, prediction reducing uncertainty, compression revealing pattern.

Core idea. Prediction and compression are the same problem dressed two ways. Both require an internal model of structure, and that single insight is the engine under modern AI.

Why this lesson exists

Most people treat prediction, compression, and intelligence as separate topics. They aren't. Compressing a file well requires you to model what's likely. Predicting the next token requires you to model what's likely. Both are the same operation viewed from different ends.

Modern AI is built almost entirely on this single insight: train a system to predict, and a useful internal model of the world falls out as a byproduct. If you've ever wondered why next-token prediction was enough to produce systems that look like they understand things, the answer lives here.

Pattern and randomness

A pattern is something a shorter description can capture. Randomness is something only a full-length description can. That single distinction sits underneath all of modern AI.

Consider 3 strings, each 1 KB long:

ABABABAB... (alternating A and B for 1024 characters)
The first 1024 characters of Pride and Prejudice
1024 bytes from a hardware random number generator

Now ZIP them.

The first compresses to a few dozen bytes. The second compresses to maybe 600. The third stays at ~1024 bytes. The compression ratio is a measurement of structure. Structure compresses; randomness doesn't.

Now consider 3 predictions:

Predict the next character of ABABAB...
Predict the next word of Austen given the previous 200.
Predict the next byte from the hardware RNG.

The first is trivial. The second is hard but possible; a model that's seen enough English does it well. The third is impossible no matter how clever you are. The same hierarchy. The same reason. Both tasks (compress, predict) lean on the same underlying fact: the string has internal structure, or it doesn't.

Two views, one operation

Compression and prediction are two views of one operation. To compress well, you need to know what is likely (so you can assign short codes to likely things). To predict well, you need to know what is likely (so you can pick it). The thing in common is "knowing what is likely," which is what an internal model of the data does.

The relationship is formal, not metaphorical. The source coding theorem makes it precise: the best possible compression of a data source is bounded by how predictable that source is. We'll meet the formal version in Phase 2. For now, hold the intuition: a model that compresses well and a model that predicts well are doing the same work.

Examples once you see the pattern

Video codecs. A 4K movie is mostly redundant: the next frame is mostly the previous frame. Encoders use I-frames (a full picture, sent rarely) and P-frames (a description of what changed). The encoder is predicting that the next frame will look like the last; the size of the P-frame measures how wrong that prediction was. A scene cut produces a huge P-frame (the prediction failed); a still shot produces a tiny one (the prediction succeeded). Compression ratio measures how predictable the video was.

Predictive text. Your phone keyboard suggests the next word based on what you've typed. It's a small language model running on your phone. When it nails the next word, the prediction was good (and you save a tap). When it fails on a rare name, the prediction was bad. Same operation as a frontier LLM, scaled down to fit in tens of megabytes.

ZIP. A general-purpose compressor (gzip, Brotli, LZMA) is a prediction engine in disguise. It scans the input, learns a small statistical model of what's common, and uses that model to assign short codes to common patterns. The "model" is sometimes as simple as a sliding-window dictionary, but it's still a model of regularities.

Branch prediction in CPUs. A modern CPU's branch predictor watches the recent history of conditional branches and tries to predict which way each one will go. When it predicts right, the pipeline stays full. When it predicts wrong, you eat a 10-15 cycle stall. The predictor is a tiny in-silicon model of program structure. Cache prefetching is the same idea applied to memory accesses: predict which lines you'll need, fetch them speculatively. Hardware has been doing prediction for 30 years because it pays.

AlphaZero priors. The chess and Go engine that beat the world doesn't enumerate every move. It has a policy network whose job is to predict which moves are worth searching. The predictions are priors over a tree search. The policy network is, again, a model of structure: what does a good move look like in this position. Prediction is what makes the search tractable.

The leap modern AI made

Train a big enough model on next-token prediction over enough text, and what comes out is not just a token predictor. It's a system that has, in its weights, an internal representation of:

Syntax (it predicts grammatical tokens after grammatical contexts).
Semantics (it predicts coherent meanings).
World knowledge (it predicts that "Paris" follows "the capital of France is").
Style (it predicts formal tokens after formal contexts).
Reasoning patterns (it predicts the conclusion that follows a premise).

Nobody put those representations into the model. They fell out as side effects of getting good at prediction. This is the key empirical claim modern AI rests on, and the reason "just predict the next token" turned out to be more profound than it looked when the first transformers landed.

Wonder, not mysticism

The careful thing to say. The model has an internal world model in the technical sense: a learned representation of the regularities it has been trained to predict. That isn't consciousness, self-awareness, or understanding in the human sense. It's a tensor of numbers that, when used as the system intended, produces outputs consistent with the structure of its training data.

The reason this looks like understanding is that human understanding also produces outputs consistent with the structure of human data. Same outputs do not mean same mechanism. We'll come back to this carefully in Phase 7. For now: the surprise is real, the mechanism is nameable, and there's no mysticism in the explanation.

Back to the loop

In the language of L1: prediction is what optimisation optimises against. The objective function for a language model says "the right answer is the token that actually appeared next in the training data; reward weights that gave it high probability, punish weights that didn't." That single signal, applied billions of times, drags the system toward modelling the structure of its inputs.

Input goes in; the representation gets built; predictions come out; prediction error feeds back. Same loop you saw in L1, with prediction as the target.

The figure

Figure 2.1 makes the structure-versus-randomness contrast visible: structured data compresses to a fraction of its original size and yields a confident prediction; random data does neither.

FIG 2.1. Structured data (left): compresses ~25× and supports a peaked next-item distribution. Random data (right): doesn't compress and produces a uniform next-item distribution. The compression ratio and the sharpness of the predictive distribution measure the same thing: how much structure is in the data.

Three things to hold onto

Compression and prediction are the same operation. Structure makes both possible; randomness blocks both.
A predictive system has an internal model of the regularities it predicts over. That model is neither magical nor conscious. It's whatever weights the optimisation found.
Modern AI scaled this idea hard. Next-token prediction turned out to be enough of a training signal to extract a useful internal model of language, knowledge, and reasoning patterns.

Prediction sits at the centre of every intelligence system. The optimisation loop, the representation work, the feedback signal: every part of the design pulls toward better prediction. That single fact is the engine under modern AI.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L2 Compare 3 cases for how they respond to being asked to "predict the next item." (a) A predictive system trained on Shakespeare. (b) A 1990s rule-based expert system. (c) Output from a hardware random number generator. For each, describe what the system can predict, what its prediction relies on, and what it cannot predict and why.

(a) The predictive system has learned a statistical model from training data. It predicts the next token using regularities found in Shakespeare's text (vocabulary, syntax, character voice patterns, period style). It fails on inputs whose structure differs from the training data; the regularities it modelled are absent or different. (b) The rule-based expert system can only predict what its hand-coded rules cover. Anything outside the rules gets no prediction or a fallback. It doesn't generalise to inputs the author didn't anticipate. (c) The hardware RNG produces unpredictable output by design. No system can predict its next byte better than chance because no structure exists in the data. The 3 cases form a hierarchy: rule-based (predicts only what was written by hand), predictive (predicts whatever the training data taught), random (cannot be predicted at all). The compression view confirms this: rule-based output and random output both compress poorly for opposite reasons; only the learned predictive model captures the underlying regularities that compress and predict together.

L2 A language model trained only on next-token prediction (no instructions, no labels, no rewards) ends up able to answer questions, summarise text, and follow patterns it has never seen explicitly. Explain mechanically why this happens. Why is the prediction objective alone enough?

The training objective rewards the model for assigning high probability to the token that actually came next. To do that reliably across billions of contexts, the model has to develop internal representations of the regularities that produced those token sequences: syntax (so it knows what tokens are grammatically possible next), semantics (so it knows which tokens are meaningfully consistent with what came before), world knowledge (so it knows that "the capital of France is" should be followed by "Paris"), and reasoning patterns (so it knows that "premise A, premise B, therefore" should be followed by the right conclusion). None of those representations are explicit goals of the training. They are necessary preconditions for prediction quality. The model "answers a question" by predicting the tokens that would follow the question, which is the same operation as predicting the next token of Wikipedia. There's no extra mechanism. The depth of capability tracks the richness of the internal model the prediction objective forced into existence.

↳ L3 (Forward interleave to L3, generalisation.) A model that achieves near-zero error on its training data may still perform badly on data it has never seen. From the prediction-as-compression lens of this lesson, suggest why this could happen, and what kind of internal model would generalise better. Reason from the structure intuition; don't worry about formal answers.

A model that achieves zero error on training data might have done it the wrong way: by memorising specific examples instead of learning the regularities behind them. In the compression view: instead of finding a short description of the data's structure, it stored a long description of the specific data. A long description doesn't transfer to new examples that share the same underlying structure but not the surface form. A model that generalises is one whose internal representation captured the regularities (the "compression"), which then apply to new examples drawn from the same source. The general principle: shorter explanations of the data transfer better than longer ones. Lesson 3 picks this up. The formal version is the bias-variance trade-off; the practical version is everything you do to stop a model from memorising.

Next station

Lesson 3 stands at the folded map (the next station along the bench). The question: a model that predicts well on data it's seen, does it predict well on data it hasn't? That's generalisation, and the honest answer is more interesting than "sometimes."

← Lesson 1 Lesson 3 →