PHASE 1 · FOUNDATIONS OF INTELLIGENCE

07 / 78

Representation and internal structure

Lesson 7. Phase 1: Foundations of intelligence. ~24 min read + cards + retrieval. Durability tier 1 (bedrock).

🪞

Memory palace · Bench · station 7

The mirror. The system never sees the world directly; it sees a transformed image of the world. The mirror is what the system is allowed to compute on.

Core idea. Optimisation never operates on reality directly; it operates on internal representations of reality, and representation quality determines what structure becomes easy or impossible for the system to learn.

Why this lesson exists

Most of this course has talked about what AI systems do. This lesson is about the form information takes inside them while they do it.

Nothing the system computes touches the world directly. Pixels become tensors. Text becomes integers, then float vectors. Joint angles become coordinate-frame embeddings. A board position becomes a 64-element tensor. What the system actually operates on is a transformed signal, never reality.

That transformation is the encoding. Everything downstream (the optimisation, the loss, the policy, the prediction) runs on the encoded representation, not on the world that produced it. Choose a bad representation and the system can spend infinite compute without learning what you wanted. Choose a good one and the same algorithm becomes trivial.

Raw inputs are usually poor

A megapixel image is 3M numbers. A 30-second audio clip at 48 kHz is 1.4M samples. A robot's continuous sensor stream is hundreds of values per millisecond. None of these are usable directly. They contain enormous amounts of variation irrelevant to whatever question the system is trying to answer.

Take "is there a cat in this image". The raw pixels carry the answer, but they also carry lighting, pose, camera angle, focal length, background clutter, and JPEG compression artefacts. An optimiser fed raw pixels has to discover, through the data, that all of those vary independently of cat-presence. With enough labelled data and a flexible enough model, it can. Without them, it can't.

Now imagine the same problem against intermediate representations: lines, edges, textures, parts, animals. The "is there a cat" question becomes near-trivial against the right intermediate features. This is the idea behind layered representation in deep vision systems: each layer transforms the previous into something closer to what the question cares about.

The job of a representation

A good representation preserves the structure relevant to the task and discards everything else. The "everything else" is usually most of the raw signal.

JPEG is the clean analogy. The Fourier transform converts pixel values into frequency components. Most of an image's perceptually relevant information lives in the low-frequency components; high-frequency components carry detail the eye barely sees. JPEG keeps the relevant frequencies, quantises the rest harshly, and discards anything that doesn't matter for human perception. The representation is smaller and operationally just as good.

Same principle in language. A word as a one-hot vector (50,000 zeros and one 1) is uninformative; nothing about the geometry tells you that "dog" and "cat" are closer than "dog" and "rectangle". A word embedding (a 300-dimensional dense vector) places semantically similar words at similar points in space. Same word, different representation, completely different operational properties.

Invariance

Invariance is the technical term for the discarding part. A representation is invariant to some transformation when applying that transformation to the input doesn't change the representation.

CNN architectures bake in approximate translation invariance: shift a cat in the image and the network's later layers see almost the same activation pattern. Cats remain cats when they move. The invariance reduces the amount of training data needed; the network doesn't have to see every cat in every position.

Word embeddings carry semantic invariance: "purchase" and "buy" land at nearby points despite zero overlap in spelling. Robot pose representations expressed in body coordinates bake in invariance to robot location, so a policy learned in one room transfers to another. Audio systems use mel-spectrograms instead of raw waveforms partly because mel filtering bakes in approximate pitch invariance for speech: same word at different pitches looks similar in the representation, different in the raw signal.

Representation geometry

A good representation lives in a latent space where geometry has meaning. Nearby points should represent similar things. Directions should encode interpretable axes of variation. Clusters should group inputs that share operational structure.

The famous word2vec property (king − man + woman ≈ queen) is geometric. The embedding space had a direction encoding gender, another encoding royalty, and both came out as side effects of predicting nearby words. No one programmed those axes; they emerged from the structure of language and the prediction objective.

Image embeddings show the same shape at scale. Train a contrastive vision-language model on hundreds of millions of image-text pairs and the resulting embeddings cluster by content: photos of dogs near other photos of dogs, photos of mountains near other photos of mountains. Distance in the embedding space becomes a usable proxy for semantic similarity. Nearest-neighbour search in a good embedding space is a usable retrieval mechanism. Linear classifiers on top of good embeddings often beat sophisticated classifiers on top of raw inputs. The geometry of the representation does most of the work.

Compression and representation

L2 framed prediction and compression as two views of the same operation. Representation is the third view, looked at from inside the system.

A good representation is a compressed description of the input that preserves what matters for the task. The compression ratio is the gap between raw input size and representation size. The information content of the representation is whatever survives the compression. Both quantities matter: too much compression and relevant structure is lost; too little and the optimiser is still drowning in irrelevant variation.

This is why representation learning became central to deep learning around 2012. CNNs trained end-to-end on ImageNet were not better at any one layer than the handcrafted features that preceded them. They were better because they learned the representations all the way up, jointly with the classifier, in a way that made the final classification trivial.

Hardware interaction

Representations are tensors. Tensors are what GPUs accelerate. The connection between representation engineering and hardware engineering is closer than it looks.

Embedding tables are the simplest case. A vocabulary of 50K tokens times a 4096-dimensional embedding is 200M parameters, sitting in VRAM, hit per-token on every forward pass. The memory bandwidth of fetching those rows shapes the inference budget. For very large embedding tables (recommender systems with billions of item embeddings), this becomes a primary cost driver.

Dense representations map cleanly to matmul. Sparse representations do not. The reason modern systems use dense embeddings rather than one-hot or sparse feature vectors is mostly hardware: dense vectors are what tensor cores accelerate. The representational choice and the silicon constraint move together. Cache locality matters too. Sequential access through a representation hits cache predictably; random lookup through an unsorted table does not. Systems that scale well at inference arrange their representations so the access pattern is friendly to the memory hierarchy.

A hardware echo: branch predictor history tables from L2 and L4 are themselves a representation. The CPU encodes "what happened recently at this branch site" as a small bit-pattern, hashes it into a table, and uses the result as a learned representation of program behaviour. Compilers do something similar with intermediate representations: source code is compressed into IR before optimisation, and the choice of IR shapes which optimisations the compiler can express.

Representation failure

Bad representations break systems in characteristic ways.

Missing state. The representation omits information the right decision depends on. A robot whose state vector omits gripper contact force cannot learn a grasping policy that adjusts for slip. No amount of training recovers what the representation never carried.

Collapsed embeddings. Training produces an embedding where most inputs map to similar points. The space lost its discriminative structure. Common in contrastive learning when the loss lets the optimiser cheat by ignoring inputs.

Spurious correlations. The representation learns features that correlate with the label in training but don't reflect underlying causes. CLIP latching onto text-in-image cues. Pneumonia detection from chest X-rays that secretly learned which hospital took the scan (because positive cases were biased by hospital). The representation worked on training and failed on deployment because the feature it depended on was a correlation, not a cause.

Brittle feature spaces. The representation works on the training distribution but small input perturbations destroy it. Adversarial examples in vision are the cleanest demonstration: a few pixels of noise, invisible to humans, push the representation into a region where the classifier reads "ostrich" instead of "cat".

The raw signal becomes the mirror

Figure 7.1 follows the transformation. The top panel shows raw input shrinking into progressively more useful internal forms as it moves through layers. The middle panel shows what a good embedding space looks like once it has been built: clusters and directions with operational meaning. The bottom panel shows the invariance test: different raw inputs that share underlying structure map to the same representation, so downstream processing sees a uniform thing where the raw signal varied.

FIG 7.1. Three views of representation. Top: raw input shrinks through stages of feature extraction; dimensionality drops by orders of magnitude as irrelevant variation is discarded. Middle: a well-shaped embedding space has clusters with operational meaning and directions that encode interpretable axes. Bottom: invariance turns multiple raw inputs that share underlying structure into the same internal representation, so downstream processing sees a uniform thing.

The L1 to L6 view

In L1's loop, the representation is the internal state the system actually operates on. In L2's terms, it is the compressed encoding that prediction runs over. In L3's terms, the abstractions that survive distribution shift live inside representations. In L4's terms, emergent capabilities at scale are emergent representations. In L5's terms, the learning signal determines what the representation ends up encoding. In L6's terms, state representation quality determines policy quality.

The takeaway

The system never sees reality. It sees a transformed signal, and everything downstream operates on that signal. The transformation is doing a huge amount of the work.

Good representations preserve what matters, discard what doesn't, place similar things near each other geometrically, and map cleanly to the hardware that has to compute on them. Bad representations leave the optimiser fighting noise it cannot see past.

The mirror on the bench shows the system what it is allowed to see. Choose what to put in the mirror and you choose what the system can learn.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L7 A team wants to classify whether a photo contains a person. They try 3 representations: (a) raw RGB pixels fed to a logistic-regression classifier, (b) handcrafted features (HOG, edge histograms, colour statistics), (c) deep features from a pretrained vision model's penultimate layer. For each, describe the optimisation difficulty, the amount of training data needed, and one failure mode you'd expect.

(a) Raw RGB pixels. Optimisation difficulty: extreme. The classifier has to discover, through the data, that translation, lighting, scale, pose, and background clutter vary independently of person-presence. A linear classifier on raw pixels cannot represent the non-linear relationship between pixel values and the concept "person"; the search space is huge. Data needed: millions of labelled examples, and the result is still brittle. Failure mode: any input outside the training distribution (different lighting, unusual pose) fails because the representation never abstracted away the nuisance variables. (b) Handcrafted features. Optimisation difficulty: moderate. HOG and similar features already discard a lot of pixel-level variation and bake in some translation and lighting invariance. A linear classifier can do reasonably well. Data needed: tens of thousands, not millions. Failure mode: the features were designed for canonical pose and scale; unusual poses, partial occlusion, or low-resolution inputs break them because the engineers picked the wrong invariances for the long tail. (c) Deep features. Optimisation difficulty: trivial. The pretrained model already learned hierarchical features that abstract pose, scale, lighting, and clutter; a linear classifier on top has to find the "person" direction in feature space. Data needed: hundreds to a few thousand. Failure mode: distribution shift from the pretraining corpus (medical images, satellite imagery, niche industrial settings) where the learned features don't carry the right structure. The lesson: representation quality dominates classifier quality. Optimisation difficulty is mostly representation quality in disguise.

L7 An embedding system for similar-product recommendation is trained and the team notices that all products are mapping to a narrow region of the embedding space (cosine similarity is near-uniform 0.9 across pairs of arbitrary products). What 3 mechanisms could have caused this, what does it mean operationally, and what would you check first?

Three plausible causes. (1) Loss-function cheat. A contrastive setup where the loss can be minimised by mapping everything to one direction. If the positive-pair distance is penalised but the negative-pair term is weak or absent, the optimiser finds a degenerate solution where all embeddings are similar. The optimiser is doing exactly what the loss asked; the loss is mis-specified. (2) Insufficient representational capacity. The embedding dimension is too small for the number of distinct products. With only a few free dimensions and many products, the optimiser cannot place them far apart and the embeddings collapse toward their mean. (3) Strong central tendency in the training data. Most products share a few features (same category, same colour palette, same brand voice) and those features dominate the embedding because they explain most of the variance. The model learned a representation of "average product" with small deviations. Operationally, this means nearest-neighbour retrieval will be useless: any query returns essentially the whole catalogue. Downstream recommender metrics collapse to baseline. First checks: (i) inspect the loss curve and confirm whether negative-pair distances are actually being penalised; (ii) histogram pairwise cosine similarities on a held-out set, look for a sharp peak near 1; (iii) plot variance per embedding dimension, look for collapsed dimensions. The fix follows from the cause.

↳ L8 (Forward interleave to L8, tokens.) Tokens are the first representation an LLM sees of text. Why does the choice of tokeniser matter, in terms of what the model can or cannot represent? What would change in a model's learning behaviour if you tokenised words as full units versus characters?

Tokenisation is the first lossy compression step in the LLM pipeline. The choice of tokens determines what the model can express in one step of internal computation, how long sequences become, and what kinds of patterns the model finds easy or hard to learn. Full-word tokenisation: the vocabulary is large (hundreds of thousands of distinct items needed to cover a real corpus), each token carries a lot of meaning, sequences are short. The model gets a lot of semantic information per position, but rare words and morphological variants (plurals, conjugations, novel compounds) either get split into pieces or fall out of the vocabulary entirely. Generalisation to unseen words is poor. Character tokenisation: vocabulary is tiny (a few hundred at most for most scripts), each token carries little meaning, sequences are 5-10× longer than word-tokenised equivalents. The model has to assemble higher-level structure out of character sequences inside its layers, which costs depth and compute. Generalisation to novel words is good, but the model spends capacity on spelling regularities. Modern systems pick byte-pair encoding or similar middles: common words remain whole, rare words decompose into smaller units, vocabulary is bounded (typically 32K-200K). This is a representation choice that shapes everything downstream: what the model can express in one attention pass, how long the context window has to be in tokens to cover a given document, where the model excels (common-language fluency) and where it stumbles (multilingual coverage, code, rare names). L8 takes this directly.

Next station

Lesson 8 sits at the spool of solder on the bench (station 8) and looks at the most consequential representation choice in modern AI: tokenisation, the quantising of messy continuous-feeling input into the discrete units everything downstream computes on.

← Lesson 6 Lesson 8 →