L11 gave you the arrow. L12 gave you the geometry between arrows. Both used 2D and 3D pictures because they had to. Real AI representation spaces are 768, 1024, 4096, sometimes 12288 dimensions wide. That choice isn't arbitrary, and it isn't decoration. Each dimension is a slot the model can use to encode a distinction. Spaces are large because the work the model is doing requires that many slots.
This lesson is where the gap between a picture you can draw and a space the model actually uses stops being mysterious. Once you see dimensions as degrees of freedom, the embedding widths in modern papers start reading as engineering decisions, not magic numbers.
You can describe a resistor with one number: its value. 10 kΩ. That places it on a 1D line. Useful, but not enough for design work.
Add tolerance: 10 kΩ ± 1%. Now each resistor lives at a point in a 2D plane (value, tolerance). A 1% part and a 5% part at the same nominal value sit in different places.
Add package size: 0402, 0603, 0805. Now it's 3D: (value, tolerance, package). Three numbers, three independent axes, one point per part.
Keep going. Power rating, temperature coefficient, voltage rating, end-of-life drift, MTBF, source country, MSL rating, cost per piece in a reel. Each new property is a new axis the part lives along. Each axis lets you make a distinction you couldn't make before.
That's the whole picture. A dimension is a thing about an item that can vary independently of the other things you already track. The space your items live in is the cross product of all those axes. Engineers call this a feature space. It's the same idea as the coordinate space from L11, with one new emphasis: the axes mean something.
Not every new axis adds capacity. If you add a "resistance in ohms" axis and a "resistance in kilohms" axis, you haven't gained anything. The second axis is a constant multiple of the first; every part lies on the same diagonal, and the space is really still 1D pretending to be 2D.
Dimensions only count when they're independent: when knowing the value along one axis tells you essentially nothing about the value along another. Two perfectly correlated axes collapse to one. Two perfectly orthogonal axes give you a genuine plane to spread out in.
Real features sit between those extremes. "Package size" and "power rating" are correlated (bigger packages dissipate more power), but they're not identical. The independent information in package-size, given that you already know power-rating, is still meaningful, just less than a fully orthogonal axis would offer. Engineers often call this the effective dimensionality of a representation: how many independent directions of variation actually exist, regardless of how many nominal axes you wrote down.
For a long time, the dimensions in a representation came from human engineering. You sat down, decided what properties mattered for the task, and computed each of them as a feature.
Image recognition used HOG (gradient histograms in patches), SIFT (keypoint descriptors), GIST (rough scene shape). Speech recognition used MFCCs (cepstral coefficients from short windows of audio). Text classification used TF-IDF (term frequencies weighted by rarity). Each was a deliberately chosen list of numbers per input. The dimensions had names, and the names came from somebody's idea about what mattered.
This approach scales as far as human intuition does. It works fine for well-understood, narrow problems. It breaks down on hard, open-ended problems, because the features that actually matter are often things humans can't name. A face is recognisable from features no list of measurements captures cleanly. A sentence's tone lives in interactions between words that resist enumeration.
That's the bottleneck the field hit by the early 2010s. Hand-designed features were a ceiling, not a floor.
Modern AI replaced "engineer designs the features" with "the model learns them". Gradient descent, given enough capacity and data, finds dimensions that minimise the loss. The dimensions don't have to correspond to anything a human would name. They just have to carry the information the loss rewards.
A trained image model has internal dimensions that activate for "rounded edge here", "long horizontal line near top", "skin-tone region", "high-frequency repeating texture". Nobody told the model to track those things. Gradient descent invented them because they make the loss go down. Each is a learned representational axis.
The same move shows up everywhere. A language model's hidden states encode dimensions for "currently inside a quotation", "this sentence is a question", "the subject is plural", "we're in a formal register". A protein-folding model has dimensions for structural motifs. A recommender's user embeddings have dimensions that loosely correspond to taste clusters, none of which were specified up front.
This is the move worth holding onto. The model isn't filling in a coordinate system you handed it. It's building its own. The width of the embedding sets how many independent axes it has room to build.
One honest wrinkle, worth holding alongside the slot picture. Real models often pack more features than they have nominal dimensions, by encoding several at once into overlapping linear combinations of the same axes. Interpretability research calls this superposition; it's why a single hidden unit can look like it tracks several unrelated things, and why "one dimension, one clean feature" is an idealisation more than a literal description. The slot intuition still holds at the level of representation capacity. It just gets sharper once you know the slots can share occupants.
The most concrete reason high-dimensional spaces matter is separability. Classes that look hopelessly tangled in low dimensions often become cleanly separable when you add the right extra dimensions.
Take the classic exclusive-or problem. Four points: (0,0) and (1,1) are class A; (0,1) and (1,0) are class B. In 2D, no straight line separates them. You can curve through them, but any flat boundary fails. The classes are not linearly separable in 2D.
Now add a third dimension: z = x · y. Class A becomes (0,0,0) and (1,1,1); class B becomes (0,1,0) and (1,0,0). Now the plane z = 0.5 separates them cleanly. A 2D problem you couldn't solve linearly became a 3D problem you could solve trivially.
The trick generalises. Lifting data into a richer space changes what counts as a "simple" boundary. The whole reason a neural network can carve very complex decision regions in input space is that it implicitly lifts the input into a high-dimensional intermediate space where simple linear separations do the work, then projects back.
z = x·y) and a flat plane separates the classes cleanly. Neural networks do this for a living: they lift inputs into rich learned spaces where the hard separation becomes easy.z = x·y because we knew the structure of XOR). What does a neural network do that's analogous, when it solves a non-linearly-separable problem in input space?
It learns the lift. A deep network's hidden layers transform the input into a sequence of intermediate representations. Each layer is, roughly, "apply a linear map, then a nonlinear function". The composition of all these layers is the lift into a high-dimensional space where the task becomes linearly separable at the final classifier. Nobody designed that representation; gradient descent built it by minimising the loss. The hidden dimensions of every layer are slots the optimiser uses to construct the lift.
How many distinct "things" can a space of dimension d represent? Roughly, an enormous amount. Even with just 32 binary bits, you've got 2³² ≈ 4 billion possible vectors. Real embeddings use 768 or more continuous dimensions, and the number of meaningfully different points in that space is vast beyond any number that's useful to write down.
But raw capacity isn't the operationally interesting number. The interesting number is how many usefully different distinctions the trained representation can carry. That depends on how many independent directions the training actually shaped. A 768-dim embedding where 50 dimensions do most of the work has effective capacity around 50, not 768.
This is why model designers care about embedding width as a hyperparameter. Too narrow and the model can't hold enough features to do the task well; subtle distinctions collapse and representation capacity is capped. Too wide and most of the dimensions are underused, parameters are wasted, and compute and memory bills climb without payoff. The sweet spot is whatever width is just enough for the task at the given training scale.
This connects directly back to L4's emergence story. Some capabilities arrive sharply at scale because they need a minimum number of dimensions to encode, and below that width the model literally can't represent them. Once the width is enough, gradient descent finds the configuration and the capability switches on.
You should know about three counter-intuitions before you start trusting your 3D intuition in 1000 dimensions. None require formal derivation; each is worth recognising.
Random vectors are nearly orthogonal. In 2D, two random arrows have a roughly 50% chance of being within 45° of each other. In 1000D, the angle between any two random unit vectors is almost always close to 90°. There's so much "room to be different" that random things spread out into nearly-perpendicular directions. This is part of why high-D spaces have so much capacity, and part of why nearest-neighbour search gets harder.
Distances concentrate. In high dimensions, the distance from any one random point to most other random points tends to be close to a fixed value. The "nearest" and "farthest" points are barely distinguishable by distance alone. Learned embeddings beat this by being far from random: training pulls related items closer than random would predict. But it's a real effect, and it's part of why ANN indexes have to work harder than brute force suggests.
Volume sits near the surface. Most of the volume of a high-dimensional ball is concentrated near its surface, not its centre. This sounds odd until you remember that "near the surface" is geometric, and in high dimensions there's a lot of "near surface" relative to the interior. Practical consequence: a sphere of "good" representations in a high-D space contains essentially all its volume in a thin shell, which is why sampling and uniform priors have to be designed carefully.
These are sometimes packaged as "the curse of dimensionality". It isn't really a curse, it's a property. The same expansiveness that gives high-dim spaces their capacity also makes them statistically and computationally awkward. Most of modern representation learning is, in part, about taming this awkwardness with structure.
Each new dimension costs memory, compute, and bandwidth. The trade is mechanical and shows up everywhere in production AI.
Storage of an embedding vector scales linearly with width. A 768-dim fp16 vector is 1.5 kB; a 4096-dim fp16 vector is 8 kB. A vector database of 100 million 4096-dim vectors needs 800 GB in fp16, before any index overhead.
The matrix multiplications inside the model scale with the square of the hidden width. A linear layer mapping a width-d vector to a width-d vector takes O(d²) multiply-adds per token. Double the width and that piece of the model uses 4× the compute. Across many layers, those scaling factors stack.
The KV cache in a transformer scales with hidden width × context length × number of layers. At long context and large width, the cache alone can dwarf the model weights, which is one of the structural reasons modern attention variants (grouped-query, sliding-window, latent attention) exist. The architecture is being bent by the constraint.
So the engineering question is rarely "how high-dimensional should we go" in isolation. It's "what's the smallest dimension that still gives us the representational capacity the task needs, given the compute and memory budget we have to live inside". That trade is the whole reason the compute spectrum looks the way it does.
"Latent space" is the term you'll hear for the high-dimensional space the model's intermediate activations live in. The input gets transformed through layers of computation; at each layer, the data is now represented in a different learned coordinate system. The deepest layer's representation, just before the output head, is usually called the latent space, but every layer has its own.
This is where the model does its actual reasoning. The input layer holds raw tokens or pixels. The output layer holds task-specific decisions. Everything in between is the model navigating a learned high-dim space where the task gets solved.
The reason it works is the central one this lesson has been pointing at. Gradient descent shapes those latent dimensions so that the structure of the task lines up with the geometry of the representation. Classes become separable. Similar inputs become geometrically close. Useful directions emerge for the model to use. The latent space is the workspace the optimiser built for itself.
Embedding width is one of the first things that gets cut as you move down the compute spectrum.
The same task, deployed at different tiers, gets a different representational budget. Same maths; different ceiling on capacity. The capability differences between tiers track the dimensional budget more than any other single factor.
Phase 1 said capability is downstream of representation. L11 made representation a vector. L12 made the geometry between vectors meaningful. This lesson finishes that thread: representational capacity, measured by independent learned dimensions, is what lets a model encode the structure of a task richly enough to generalise inside it.
A model with too few dimensions has to compromise. It can fit common cases by collapsing distinctions that don't help on average. The brittleness shows up at the edges, where the discarded distinctions actually mattered. A model with enough dimensions can keep the distinctions and generalises cleanly.
That's why scaling laws look the way they do. Capability increases with parameter count partly because parameter count buys hidden width, and hidden width buys representational capacity, and representational capacity buys the ability to encode the structure of harder tasks. When people ask "why are big models better", the dimensional story is one of the load-bearing answers.
You now have vectors, geometry between them, and capacity behind them. The next move on the wall is the object that transforms one representation space into another: the matrix. L14 puts it on the board next to everything you've already drawn.