Most of this course has talked about what AI systems do. This lesson is about the form information takes inside them while they do it.
Nothing the system computes touches the world directly. Pixels become tensors. Text becomes integers, then float vectors. Joint angles become coordinate-frame embeddings. A board position becomes a 64-element tensor. What the system actually operates on is a transformed signal, never reality.
That transformation is the encoding. Everything downstream (the optimisation, the loss, the policy, the prediction) runs on the encoded representation, not on the world that produced it. Choose a bad representation and the system can spend infinite compute without learning what you wanted. Choose a good one and the same algorithm becomes trivial.
A megapixel image is 3M numbers. A 30-second audio clip at 48 kHz is 1.4M samples. A robot's continuous sensor stream is hundreds of values per millisecond. None of these are usable directly. They contain enormous amounts of variation irrelevant to whatever question the system is trying to answer.
Take "is there a cat in this image". The raw pixels carry the answer, but they also carry lighting, pose, camera angle, focal length, background clutter, and JPEG compression artefacts. An optimiser fed raw pixels has to discover, through the data, that all of those vary independently of cat-presence. With enough labelled data and a flexible enough model, it can. Without them, it can't.
Now imagine the same problem against intermediate representations: lines, edges, textures, parts, animals. The "is there a cat" question becomes near-trivial against the right intermediate features. This is the idea behind layered representation in deep vision systems: each layer transforms the previous into something closer to what the question cares about.
A good representation preserves the structure relevant to the task and discards everything else. The "everything else" is usually most of the raw signal.
JPEG is the clean analogy. The Fourier transform converts pixel values into frequency components. Most of an image's perceptually relevant information lives in the low-frequency components; high-frequency components carry detail the eye barely sees. JPEG keeps the relevant frequencies, quantises the rest harshly, and discards anything that doesn't matter for human perception. The representation is smaller and operationally just as good.
Same principle in language. A word as a one-hot vector (50,000 zeros and one 1) is uninformative; nothing about the geometry tells you that "dog" and "cat" are closer than "dog" and "rectangle". A word embedding (a 300-dimensional dense vector) places semantically similar words at similar points in space. Same word, different representation, completely different operational properties.
Invariance is the technical term for the discarding part. A representation is invariant to some transformation when applying that transformation to the input doesn't change the representation.
CNN architectures bake in approximate translation invariance: shift a cat in the image and the network's later layers see almost the same activation pattern. Cats remain cats when they move. The invariance reduces the amount of training data needed; the network doesn't have to see every cat in every position.
Word embeddings carry semantic invariance: "purchase" and "buy" land at nearby points despite zero overlap in spelling. Robot pose representations expressed in body coordinates bake in invariance to robot location, so a policy learned in one room transfers to another. Audio systems use mel-spectrograms instead of raw waveforms partly because mel filtering bakes in approximate pitch invariance for speech: same word at different pitches looks similar in the representation, different in the raw signal.
A good representation lives in a latent space where geometry has meaning. Nearby points should represent similar things. Directions should encode interpretable axes of variation. Clusters should group inputs that share operational structure.
The famous word2vec property (king − man + woman ≈ queen) is geometric. The embedding space had a direction encoding gender, another encoding royalty, and both came out as side effects of predicting nearby words. No one programmed those axes; they emerged from the structure of language and the prediction objective.
Image embeddings show the same shape at scale. Train a contrastive vision-language model on hundreds of millions of image-text pairs and the resulting embeddings cluster by content: photos of dogs near other photos of dogs, photos of mountains near other photos of mountains. Distance in the embedding space becomes a usable proxy for semantic similarity. Nearest-neighbour search in a good embedding space is a usable retrieval mechanism. Linear classifiers on top of good embeddings often beat sophisticated classifiers on top of raw inputs. The geometry of the representation does most of the work.
L2 framed prediction and compression as two views of the same operation. Representation is the third view, looked at from inside the system.
A good representation is a compressed description of the input that preserves what matters for the task. The compression ratio is the gap between raw input size and representation size. The information content of the representation is whatever survives the compression. Both quantities matter: too much compression and relevant structure is lost; too little and the optimiser is still drowning in irrelevant variation.
This is why representation learning became central to deep learning around 2012. CNNs trained end-to-end on ImageNet were not better at any one layer than the handcrafted features that preceded them. They were better because they learned the representations all the way up, jointly with the classifier, in a way that made the final classification trivial.
Representations are tensors. Tensors are what GPUs accelerate. The connection between representation engineering and hardware engineering is closer than it looks.
Embedding tables are the simplest case. A vocabulary of 50K tokens times a 4096-dimensional embedding is 200M parameters, sitting in VRAM, hit per-token on every forward pass. The memory bandwidth of fetching those rows shapes the inference budget. For very large embedding tables (recommender systems with billions of item embeddings), this becomes a primary cost driver.
Dense representations map cleanly to matmul. Sparse representations do not. The reason modern systems use dense embeddings rather than one-hot or sparse feature vectors is mostly hardware: dense vectors are what tensor cores accelerate. The representational choice and the silicon constraint move together. Cache locality matters too. Sequential access through a representation hits cache predictably; random lookup through an unsorted table does not. Systems that scale well at inference arrange their representations so the access pattern is friendly to the memory hierarchy.
A hardware echo: branch predictor history tables from L2 and L4 are themselves a representation. The CPU encodes "what happened recently at this branch site" as a small bit-pattern, hashes it into a table, and uses the result as a learned representation of program behaviour. Compilers do something similar with intermediate representations: source code is compressed into IR before optimisation, and the choice of IR shapes which optimisations the compiler can express.
Bad representations break systems in characteristic ways.
Missing state. The representation omits information the right decision depends on. A robot whose state vector omits gripper contact force cannot learn a grasping policy that adjusts for slip. No amount of training recovers what the representation never carried.
Collapsed embeddings. Training produces an embedding where most inputs map to similar points. The space lost its discriminative structure. Common in contrastive learning when the loss lets the optimiser cheat by ignoring inputs.
Spurious correlations. The representation learns features that correlate with the label in training but don't reflect underlying causes. CLIP latching onto text-in-image cues. Pneumonia detection from chest X-rays that secretly learned which hospital took the scan (because positive cases were biased by hospital). The representation worked on training and failed on deployment because the feature it depended on was a correlation, not a cause.
Brittle feature spaces. The representation works on the training distribution but small input perturbations destroy it. Adversarial examples in vision are the cleanest demonstration: a few pixels of noise, invisible to humans, push the representation into a region where the classifier reads "ostrich" instead of "cat".
Figure 7.1 follows the transformation. The top panel shows raw input shrinking into progressively more useful internal forms as it moves through layers. The middle panel shows what a good embedding space looks like once it has been built: clusters and directions with operational meaning. The bottom panel shows the invariance test: different raw inputs that share underlying structure map to the same representation, so downstream processing sees a uniform thing where the raw signal varied.
In L1's loop, the representation is the internal state the system actually operates on. In L2's terms, it is the compressed encoding that prediction runs over. In L3's terms, the abstractions that survive distribution shift live inside representations. In L4's terms, emergent capabilities at scale are emergent representations. In L5's terms, the learning signal determines what the representation ends up encoding. In L6's terms, state representation quality determines policy quality.
The system never sees reality. It sees a transformed signal, and everything downstream operates on that signal. The transformation is doing a huge amount of the work.
Good representations preserve what matters, discard what doesn't, place similar things near each other geometrically, and map cleanly to the hardware that has to compute on them. Bad representations leave the optimiser fighting noise it cannot see past.
The mirror on the bench shows the system what it is allowed to see. Choose what to put in the mirror and you choose what the system can learn.
Lesson 8 sits at the spool of solder on the bench (station 8) and looks at the most consequential representation choice in modern AI: tokenisation, the quantising of messy continuous-feeling input into the discrete units everything downstream computes on.