PHASE 2 · THE WHITEBOARD WALL

L13 · 13 / 79 visited

Dimensions, feature spaces, and representation capacity

Lesson 13. Third station on the whiteboard wall. ~26 min read + cards + retrieval. Durability tier 1 (bedrock; the lesson where representation gains capacity).

📊

Memory palace · Whiteboard wall · station 13

The layered grids. The map from L12 gets a transparent overlay, then another, then another. Each overlay adds a new axis of distinction. The wall is gaining depth. You can almost see through it.

Core idea. A dimension is a representational degree of freedom: one independent way the system can distinguish things. More dimensions means more distinctions it can hold at once. The most important move modern AI made was learning to build its own dimensions from data, instead of relying on the ones engineers hand-designed.

Why this lesson exists

L11 gave you the arrow. L12 gave you the geometry between arrows. Both used 2D and 3D pictures because they had to. Real AI representation spaces are 768, 1024, 4096, sometimes 12288 dimensions wide. That choice isn't arbitrary, and it isn't decoration. Each dimension is a slot the model can use to encode a distinction. Spaces are large because the work the model is doing requires that many slots.

This lesson is where the gap between a picture you can draw and a space the model actually uses stops being mysterious. Once you see dimensions as degrees of freedom, the embedding widths in modern papers start reading as engineering decisions, not magic numbers.

Start with one number

You can describe a resistor with one number: its value. 10 kΩ. That places it on a 1D line. Useful, but not enough for design work.

Add tolerance: 10 kΩ ± 1%. Now each resistor lives at a point in a 2D plane (value, tolerance). A 1% part and a 5% part at the same nominal value sit in different places.

Add package size: 0402, 0603, 0805. Now it's 3D: (value, tolerance, package). Three numbers, three independent axes, one point per part.

Keep going. Power rating, temperature coefficient, voltage rating, end-of-life drift, MTBF, source country, MSL rating, cost per piece in a reel. Each new property is a new axis the part lives along. Each axis lets you make a distinction you couldn't make before.

That's the whole picture. A dimension is a thing about an item that can vary independently of the other things you already track. The space your items live in is the cross product of all those axes. Engineers call this a feature space. It's the same idea as the coordinate space from L11, with one new emphasis: the axes mean something.

FIG 13.1. Adding a dimension means adding a property that can vary independently. The same nominal-value resistors that collapse to one point in 1D pull apart once tolerance is admitted. They pull apart further once package is admitted. Each dimension is a representational degree of freedom.

The word that matters: independent

Not every new axis adds capacity. If you add a "resistance in ohms" axis and a "resistance in kilohms" axis, you haven't gained anything. The second axis is a constant multiple of the first; every part lies on the same diagonal, and the space is really still 1D pretending to be 2D.

Dimensions only count when they're independent: when knowing the value along one axis tells you essentially nothing about the value along another. Two perfectly correlated axes collapse to one. Two perfectly orthogonal axes give you a genuine plane to spread out in.

Real features sit between those extremes. "Package size" and "power rating" are correlated (bigger packages dissipate more power), but they're not identical. The independent information in package-size, given that you already know power-rating, is still meaningful, just less than a fully orthogonal axis would offer. Engineers often call this the effective dimensionality of a representation: how many independent directions of variation actually exist, regardless of how many nominal axes you wrote down.

FIG 13.2. The left plot has two axes but only one direction of real variation. The right plot has two axes and uses both. Representation capacity counts independent directions, not nominal axes. This distinction matters a lot once spaces get high-dimensional.

Feature spaces, hand-designed

For a long time, the dimensions in a representation came from human engineering. You sat down, decided what properties mattered for the task, and computed each of them as a feature.

Image recognition used HOG (gradient histograms in patches), SIFT (keypoint descriptors), GIST (rough scene shape). Speech recognition used MFCCs (cepstral coefficients from short windows of audio). Text classification used TF-IDF (term frequencies weighted by rarity). Each was a deliberately chosen list of numbers per input. The dimensions had names, and the names came from somebody's idea about what mattered.

This approach scales as far as human intuition does. It works fine for well-understood, narrow problems. It breaks down on hard, open-ended problems, because the features that actually matter are often things humans can't name. A face is recognisable from features no list of measurements captures cleanly. A sentence's tone lives in interactions between words that resist enumeration.

That's the bottleneck the field hit by the early 2010s. Hand-designed features were a ceiling, not a floor.

Learned features, the move that changed everything

Modern AI replaced "engineer designs the features" with "the model learns them". Gradient descent, given enough capacity and data, finds dimensions that minimise the loss. The dimensions don't have to correspond to anything a human would name. They just have to carry the information the loss rewards.

A trained image model has internal dimensions that activate for "rounded edge here", "long horizontal line near top", "skin-tone region", "high-frequency repeating texture". Nobody told the model to track those things. Gradient descent invented them because they make the loss go down. Each is a learned representational axis.

The same move shows up everywhere. A language model's hidden states encode dimensions for "currently inside a quotation", "this sentence is a question", "the subject is plural", "we're in a formal register". A protein-folding model has dimensions for structural motifs. A recommender's user embeddings have dimensions that loosely correspond to taste clusters, none of which were specified up front.

This is the move worth holding onto. The model isn't filling in a coordinate system you handed it. It's building its own. The width of the embedding sets how many independent axes it has room to build.

mechanism · learned features in one sentence Each hidden dimension is a slot the optimiser can use to track some feature of the input. Which feature it tracks is whatever turns out to lower the loss most. Across millions of training steps, the slots get assigned to the features that pay off most. The model is, mechanically, a feature-discovery system as much as it is a prediction system.

One honest wrinkle, worth holding alongside the slot picture. Real models often pack more features than they have nominal dimensions, by encoding several at once into overlapping linear combinations of the same axes. Interpretability research calls this superposition; it's why a single hidden unit can look like it tracks several unrelated things, and why "one dimension, one clean feature" is an idealisation more than a literal description. The slot intuition still holds at the level of representation capacity. It just gets sharper once you know the slots can share occupants.

FIG 13.6. Each hidden layer is a learned coordinate system. Early layers' dimensions tend to track local features (edges, gradients); middle layers' dimensions encode parts and textures; deep layers' dimensions hold abstract, task-relevant structure that the classifier can separate cleanly. Nobody specified what each dimension should mean; gradient descent assigned each slot to whatever helped the loss go down.

Separability: why more dimensions help

The most concrete reason high-dimensional spaces matter is separability. Classes that look hopelessly tangled in low dimensions often become cleanly separable when you add the right extra dimensions.

Take the classic exclusive-or problem. Four points: (0,0) and (1,1) are class A; (0,1) and (1,0) are class B. In 2D, no straight line separates them. You can curve through them, but any flat boundary fails. The classes are not linearly separable in 2D.

Now add a third dimension: z = x · y. Class A becomes (0,0,0) and (1,1,1); class B becomes (0,1,0) and (1,0,0). Now the plane z = 0.5 separates them cleanly. A 2D problem you couldn't solve linearly became a 3D problem you could solve trivially.

The trick generalises. Lifting data into a richer space changes what counts as a "simple" boundary. The whole reason a neural network can carve very complex decision regions in input space is that it implicitly lifts the input into a high-dimensional intermediate space where simple linear separations do the work, then projects back.

FIG 13.3. The XOR problem has no linear solution in 2D. Lift it to 3D using the right extra feature (here z = x·y) and a flat plane separates the classes cleanly. Neural networks do this for a living: they lift inputs into rich learned spaces where the hard separation becomes easy.

checkpoint · spot the principle The "extra dimension" in FIG 13.3 was hand-designed (we picked z = x·y because we knew the structure of XOR). What does a neural network do that's analogous, when it solves a non-linearly-separable problem in input space?

It learns the lift. A deep network's hidden layers transform the input into a sequence of intermediate representations. Each layer is, roughly, "apply a linear map, then a nonlinear function". The composition of all these layers is the lift into a high-dimensional space where the task becomes linearly separable at the final classifier. Nobody designed that representation; gradient descent built it by minimising the loss. The hidden dimensions of every layer are slots the optimiser uses to construct the lift.

Capacity: how much a space can hold

How many distinct "things" can a space of dimension d represent? Roughly, an enormous amount. Even with just 32 binary bits, you've got 2³² ≈ 4 billion possible vectors. Real embeddings use 768 or more continuous dimensions, and the number of meaningfully different points in that space is vast beyond any number that's useful to write down.

But raw capacity isn't the operationally interesting number. The interesting number is how many usefully different distinctions the trained representation can carry. That depends on how many independent directions the training actually shaped. A 768-dim embedding where 50 dimensions do most of the work has effective capacity around 50, not 768.

This is why model designers care about embedding width as a hyperparameter. Too narrow and the model can't hold enough features to do the task well; subtle distinctions collapse and representation capacity is capped. Too wide and most of the dimensions are underused, parameters are wasted, and compute and memory bills climb without payoff. The sweet spot is whatever width is just enough for the task at the given training scale.

This connects directly back to L4's emergence story. Some capabilities arrive sharply at scale because they need a minimum number of dimensions to encode, and below that width the model literally can't represent them. Once the width is enough, gradient descent finds the configuration and the capability switches on.

The catch: high-dimensional spaces behave strangely

You should know about three counter-intuitions before you start trusting your 3D intuition in 1000 dimensions. None require formal derivation; each is worth recognising.

Random vectors are nearly orthogonal. In 2D, two random arrows have a roughly 50% chance of being within 45° of each other. In 1000D, the angle between any two random unit vectors is almost always close to 90°. There's so much "room to be different" that random things spread out into nearly-perpendicular directions. This is part of why high-D spaces have so much capacity, and part of why nearest-neighbour search gets harder.

Distances concentrate. In high dimensions, the distance from any one random point to most other random points tends to be close to a fixed value. The "nearest" and "farthest" points are barely distinguishable by distance alone. Learned embeddings beat this by being far from random: training pulls related items closer than random would predict. But it's a real effect, and it's part of why ANN indexes have to work harder than brute force suggests.

Volume sits near the surface. Most of the volume of a high-dimensional ball is concentrated near its surface, not its centre. This sounds odd until you remember that "near the surface" is geometric, and in high dimensions there's a lot of "near surface" relative to the interior. Practical consequence: a sphere of "good" representations in a high-D space contains essentially all its volume in a thin shell, which is why sampling and uniform priors have to be designed carefully.

These are sometimes packaged as "the curse of dimensionality". It isn't really a curse, it's a property. The same expansiveness that gives high-dim spaces their capacity also makes them statistically and computationally awkward. Most of modern representation learning is, in part, about taming this awkwardness with structure.

FIG 13.4. In low dimensions, random vectors point in many different directions and pairwise distances vary widely. As dimension grows, both distributions concentrate: almost every pair of random vectors is nearly perpendicular, and almost every pair of random points is at nearly the same distance. Learned representations push against this by adding structure, but the effect shapes how every high-D system has to be designed.

Capacity costs something

Each new dimension costs memory, compute, and bandwidth. The trade is mechanical and shows up everywhere in production AI.

Storage of an embedding vector scales linearly with width. A 768-dim fp16 vector is 1.5 kB; a 4096-dim fp16 vector is 8 kB. A vector database of 100 million 4096-dim vectors needs 800 GB in fp16, before any index overhead.

The matrix multiplications inside the model scale with the square of the hidden width. A linear layer mapping a width-d vector to a width-d vector takes O(d²) multiply-adds per token. Double the width and that piece of the model uses 4× the compute. Across many layers, those scaling factors stack.

The KV cache in a transformer scales with hidden width × context length × number of layers. At long context and large width, the cache alone can dwarf the model weights, which is one of the structural reasons modern attention variants (grouped-query, sliding-window, latent attention) exist. The architecture is being bent by the constraint.

So the engineering question is rarely "how high-dimensional should we go" in isolation. It's "what's the smallest dimension that still gives us the representational capacity the task needs, given the compute and memory budget we have to live inside". That trade is the whole reason the compute spectrum looks the way it does.

FIG 13.5. The three cost axes don't grow at the same rate. Storage and KV cache grow linearly with hidden width. Matmul compute grows as the square. The sweet spot is the smallest width that gives the task the representational capacity it needs, because going wider is paid for in steeply rising compute.

Latent space: the model's working coordinate system

"Latent space" is the term you'll hear for the high-dimensional space the model's intermediate activations live in. The input gets transformed through layers of computation; at each layer, the data is now represented in a different learned coordinate system. The deepest layer's representation, just before the output head, is usually called the latent space, but every layer has its own.

This is where the model does its actual reasoning. The input layer holds raw tokens or pixels. The output layer holds task-specific decisions. Everything in between is the model navigating a learned high-dim space where the task gets solved.

The reason it works is the central one this lesson has been pointing at. Gradient descent shapes those latent dimensions so that the structure of the task lines up with the geometry of the representation. Classes become separable. Similar inputs become geometrically close. Useful directions emerge for the model to use. The latent space is the workspace the optimiser built for itself.

Compute spectrum: dimensions hit different walls

Embedding width is one of the first things that gets cut as you move down the compute spectrum.

microcontroller Latent dim 32–128. Quantised int8. Capacity bought by careful task scoping and pretraining, then distilled down.

mobile / edge Latent dim 256–768. fp16/int4 weights. On-device embedding models target this width because it fits RAM and NPU throughput.

workstation Latent dim 1024–4096. fp16/bf16. Most open-weight LLMs in the 7B–34B range sit here; width tracks parameter count.

hyperscale Latent dim 8192–18432. fp16/fp8 with mixed precision. Wall: VRAM, interconnect, and the d² scaling of compute per layer.

The same task, deployed at different tiers, gets a different representational budget. Same maths; different ceiling on capacity. The capability differences between tiers track the dimensional budget more than any other single factor.

Capacity ↔ generalisation, the deepest thread

Phase 1 said capability is downstream of representation. L11 made representation a vector. L12 made the geometry between vectors meaningful. This lesson finishes that thread: representational capacity, measured by independent learned dimensions, is what lets a model encode the structure of a task richly enough to generalise inside it.

A model with too few dimensions has to compromise. It can fit common cases by collapsing distinctions that don't help on average. The brittleness shows up at the edges, where the discarded distinctions actually mattered. A model with enough dimensions can keep the distinctions and generalises cleanly.

That's why scaling laws look the way they do. Capability increases with parameter count partly because parameter count buys hidden width, and hidden width buys representational capacity, and representational capacity buys the ability to encode the structure of harder tasks. When people ask "why are big models better", the dimensional story is one of the load-bearing answers.

compression · what to carry forward

A dimension is a representational degree of freedom: an independent way the system distinguishes things.
Only independent axes count. Correlated axes collapse to fewer effective dimensions.
Modern AI doesn't use hand-crafted features. Gradient descent builds the dimensions it needs from the loss.
More dimensions make hard separations easy by lifting data into spaces where simple boundaries do the work.
Capacity costs compute and memory: storage scales with d, matmul scales with d², KV cache scales with d × context × layers.
High-dim spaces behave non-intuitively (orthogonality, distance concentration), and learned structure is what makes them usable.
Capability scales with representational capacity. That's why bigger models, given enough data, can generalise on harder tasks.

What you should be able to do now

Explain why "an embedding has dimension 768" is an engineering decision, not a default.
Distinguish between nominal dimensionality and effective dimensionality with a worked example.
Describe a non-linearly-separable problem and explain how lifting into a higher-dimensional space makes it separable.
Name three high-dimensional behaviours that contradict 3D intuition, and what each means operationally.
Trace how doubling the hidden width changes storage, compute, and KV cache costs.
Explain why a model can't generalise on a task that requires more representational capacity than its width allows.
Connect "representation capacity" to one of the five core laws from Phase 1.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace mechanism; don't summarise.

L13 A junior engineer asks: "Why don't we just use 50-dim embeddings? They're cheaper, and we can still cluster things." Explain the answer using the vocabulary of representational capacity, independent dimensions, and separability.

You can get usable clustering at 50 dim for narrow domains, but the question is what fine-grained distinctions you need the embedding to carry. Each dimension is a representational degree of freedom; the number of independent axes the embedding has bounds the number of independent semantic axes it can encode. At 50 dim, you've got maybe 30-40 effectively independent directions after correlations. That's enough to separate broad topical clusters (animals vs vehicles vs fruit), but not enough to keep the within-cluster structure that matters for retrieval: distinguishing "Labrador retriever" from "golden retriever" needs subtler directions than "dog" from "car". At 768 dim, the model has hundreds of effective directions, which is what lets it encode hierarchy, register, sentiment, syntactic role, and the long tail of niche distinctions all in one space. The cheaper embedding works until the task starts asking for distinctions it doesn't have the dimensions to hold; then accuracy degrades unevenly, often at exactly the edge cases that matter. The trade is dimensional capacity vs serving cost; the right answer is the smallest width that keeps the distinctions the task needs.

L13 A colleague claims a 2048-dim embedding model is "twice as expressive" as a 1024-dim one. Push back, using effective dimensionality and the curse of dimensionality.

The 2048-dim model is at most twice as expressive in nominal dimension. Effective dimensionality is usually quite a bit less than nominal because training rarely produces fully independent directions across the entire width. The actual gain depends on whether the additional dimensions got shaped by the training data into useful axes or are mostly redundant. There's also the high-dim weirdness to weigh against the extra capacity. As you push past about 1k dimensions, random-vector behaviour starts to dominate any region the training didn't well-cover: angles concentrate near 90°, distances concentrate near a fixed value, and nearest-neighbour queries become less discriminating in those regions. A well-trained 2048-dim model can suppress this by structuring its space carefully, but a poorly trained one can be measurably worse than the 1024-dim model on retrieval tasks because the geometry is more "random-vector-shaped" in the under-trained directions. Operationally: doubling width buys real capacity only when the training data and objective fill the new dimensions with structure. Otherwise it just buys extra compute cost (matmul scales with d²) and a worse-behaved geometry. The honest answer is "it depends on how the bigger model was trained", not "it's twice as expressive".

L13 Explain why "deep neural networks" are deep and what that has to do with dimensions and lifting.

Each layer of a network applies a linear map followed by a nonlinearity, which together act as a small lift into a slightly different representation space. A shallow network can only lift a little before it has to produce an output; complex tasks need representations the input space can't reach in one or two steps. Stacking layers compounds the lifts: by layer 20, the representation is in a space many transformations removed from the input. The deeper composition lets the network construct representations rich enough that the final classifier's decision boundary becomes simple in the latent space, even when the task's boundary is hopelessly tangled in input space. That's the XOR story (FIG 13.3) iterated many times: each layer adds the representational freedom needed for the next layer to do its piece of the work. The width of each layer's hidden state sets how many independent features that layer's space can carry; the depth sets how many compositional lifts the network can perform. Capability is a function of both, with diminishing returns when either alone scales beyond what the other can use.

L13 A vector database team is deciding between 384-dim and 1024-dim embeddings for an internal docs system with 50M chunks. They have a serving budget of 200 GB RAM. Walk through the trade using the costs and capacity ideas from this lesson.

Storage scales linearly with width. 50M × 384 dim × 2 bytes (fp16) = 38.4 GB for the raw vectors at 384 dim; the same at 1024 dim is 102.4 GB. Add ANN index overhead (HNSW commonly doubles or triples raw size), so 384-dim sits comfortably in 200 GB even with extra layers of index, while 1024-dim is close to budget. Compute at query time: brute-force cosine cost is linear in dim, ANN cost is sub-linear but still scales with dim per comparison. 1024-dim queries are roughly 2.7× more expensive than 384-dim on the same hardware, ignoring index structure differences. Capacity: 384-dim can probably hold the broad topical structure of internal docs but may struggle with fine-grained technical distinctions (which sub-component, which version, which date) that show up in long-tail queries. 1024-dim has more room for those distinctions but costs more to serve. Concrete recommendation framework: run both on a representative query set; if 384-dim hits target recall on the queries that matter, take the cheaper option; if it misses on edge queries that have business value, the extra width is the price of getting them right. Often the right answer is 768-dim as a compromise, or 1024-dim with int8 quantisation to roughly halve storage and stay in budget. The trade is mechanical: capacity vs serving cost, decided by which capability matters.

↳ Phase 2 Look ahead. Dimensions are degrees of freedom for representing things. What single mathematical object, once you learn it, will let you describe how those representations get transformed from one space to another? Predict the answer and what it'll let you do.

Matrices. A matrix is a rectangular grid of numbers that, when multiplied by a vector, produces another vector in (possibly) a different space. Every linear layer in a neural network is a matrix multiplication: take a width-d_in vector, multiply by a (d_out × d_in) matrix, get a width-d_out vector. That's the lift between layers' representation spaces, made algebraic. Once you know matrices, you can read attention as a sequence of matrix multiplications, the embedding table as a single matrix indexed by token id, and every projection in a transformer as one matrix acting on one vector. The whole forward pass becomes "apply matrix, add bias, apply nonlinearity, repeat", and the maths gets compact enough to actually read. Beyond that, the gradient updates that train the network are themselves matrices being updated. The dimensions you learned here become the row and column counts of those matrices; the operations on dimensions you learned here are what those matrices do. L14 is where the apparatus arrives.

Next station

You now have vectors, geometry between them, and capacity behind them. The next move on the wall is the object that transforms one representation space into another: the matrix. L14 puts it on the board next to everything you've already drawn.

← Lesson 12 Lesson 14 →