Most people treat prediction, compression, and intelligence as separate topics. They aren't. Compressing a file well requires you to model what's likely. Predicting the next token requires you to model what's likely. Both are the same operation viewed from different ends.
Modern AI is built almost entirely on this single insight: train a system to predict, and a useful internal model of the world falls out as a byproduct. If you've ever wondered why next-token prediction was enough to produce systems that look like they understand things, the answer lives here.
A pattern is something a shorter description can capture. Randomness is something only a full-length description can. That single distinction sits underneath all of modern AI.
Consider 3 strings, each 1 KB long:
ABABABAB... (alternating A and B for 1024 characters)Now ZIP them.
The first compresses to a few dozen bytes. The second compresses to maybe 600. The third stays at ~1024 bytes. The compression ratio is a measurement of structure. Structure compresses; randomness doesn't.
Now consider 3 predictions:
ABABAB...The first is trivial. The second is hard but possible; a model that's seen enough English does it well. The third is impossible no matter how clever you are. The same hierarchy. The same reason. Both tasks (compress, predict) lean on the same underlying fact: the string has internal structure, or it doesn't.
Compression and prediction are two views of one operation. To compress well, you need to know what is likely (so you can assign short codes to likely things). To predict well, you need to know what is likely (so you can pick it). The thing in common is "knowing what is likely," which is what an internal model of the data does.
The relationship is formal, not metaphorical. The source coding theorem makes it precise: the best possible compression of a data source is bounded by how predictable that source is. We'll meet the formal version in Phase 2. For now, hold the intuition: a model that compresses well and a model that predicts well are doing the same work.
Video codecs. A 4K movie is mostly redundant: the next frame is mostly the previous frame. Encoders use I-frames (a full picture, sent rarely) and P-frames (a description of what changed). The encoder is predicting that the next frame will look like the last; the size of the P-frame measures how wrong that prediction was. A scene cut produces a huge P-frame (the prediction failed); a still shot produces a tiny one (the prediction succeeded). Compression ratio measures how predictable the video was.
Predictive text. Your phone keyboard suggests the next word based on what you've typed. It's a small language model running on your phone. When it nails the next word, the prediction was good (and you save a tap). When it fails on a rare name, the prediction was bad. Same operation as a frontier LLM, scaled down to fit in tens of megabytes.
ZIP. A general-purpose compressor (gzip, Brotli, LZMA) is a prediction engine in disguise. It scans the input, learns a small statistical model of what's common, and uses that model to assign short codes to common patterns. The "model" is sometimes as simple as a sliding-window dictionary, but it's still a model of regularities.
Branch prediction in CPUs. A modern CPU's branch predictor watches the recent history of conditional branches and tries to predict which way each one will go. When it predicts right, the pipeline stays full. When it predicts wrong, you eat a 10-15 cycle stall. The predictor is a tiny in-silicon model of program structure. Cache prefetching is the same idea applied to memory accesses: predict which lines you'll need, fetch them speculatively. Hardware has been doing prediction for 30 years because it pays.
AlphaZero priors. The chess and Go engine that beat the world doesn't enumerate every move. It has a policy network whose job is to predict which moves are worth searching. The predictions are priors over a tree search. The policy network is, again, a model of structure: what does a good move look like in this position. Prediction is what makes the search tractable.
Train a big enough model on next-token prediction over enough text, and what comes out is not just a token predictor. It's a system that has, in its weights, an internal representation of:
Nobody put those representations into the model. They fell out as side effects of getting good at prediction. This is the key empirical claim modern AI rests on, and the reason "just predict the next token" turned out to be more profound than it looked when the first transformers landed.
The careful thing to say. The model has an internal world model in the technical sense: a learned representation of the regularities it has been trained to predict. That isn't consciousness, self-awareness, or understanding in the human sense. It's a tensor of numbers that, when used as the system intended, produces outputs consistent with the structure of its training data.
The reason this looks like understanding is that human understanding also produces outputs consistent with the structure of human data. Same outputs do not mean same mechanism. We'll come back to this carefully in Phase 7. For now: the surprise is real, the mechanism is nameable, and there's no mysticism in the explanation.
In the language of L1: prediction is what optimisation optimises against. The objective function for a language model says "the right answer is the token that actually appeared next in the training data; reward weights that gave it high probability, punish weights that didn't." That single signal, applied billions of times, drags the system toward modelling the structure of its inputs.
Input goes in; the representation gets built; predictions come out; prediction error feeds back. Same loop you saw in L1, with prediction as the target.
Figure 2.1 makes the structure-versus-randomness contrast visible: structured data compresses to a fraction of its original size and yields a confident prediction; random data does neither.
Prediction sits at the centre of every intelligence system. The optimisation loop, the representation work, the feedback signal: every part of the design pulls toward better prediction. That single fact is the engine under modern AI.
Lesson 3 stands at the folded map (the next station along the bench). The question: a model that predicts well on data it's seen, does it predict well on data it hasn't? That's generalisation, and the honest answer is more interesting than "sometimes."