L11 gave you the arrow on the board. This lesson is where the arrows organise themselves into structure with meaning. By the end, you should look at a vector database, a recommender, a RAG pipeline, and a multimodal model and see one underlying mechanism in all of them: a geometric search problem inside a learned space.
The shift you're making is the one most people miss when they first meet embeddings. A model doesn't "know" that dog is similar to cat the way a human knows it. The model has assigned a vector to each, and during training those vectors got pulled into nearby positions because that's what the loss function rewarded. The similarity is geometric. Reading the system means reading the geometry.
Lay your tools out on the workbench. Drivers in one zone, pliers in another, scopes off to the right, hand tools on the front edge, expensive measuring gear at the back. You did that without thinking. Related tools sit near each other; unrelated tools sit further apart. If a colleague asks for "something like the calipers", they're going to look in the same region of the bench.
That's everything this lesson is about, applied to vectors. The bench is the embedding space. The position of each tool is the embedding of that tool. Closeness on the bench is similarity. A "search for related items" is a geometric query: look near here.
The maths in this lesson formalises three questions the bench already answers without you noticing:
None of this is new physics. It's bookkeeping for an intuition you already trust.
L11 introduced the magnitude of a vector. Distance between two vectors is the magnitude of their difference. That's the whole definition.
Take two points a = (a₁, a₂) and b = (b₁, b₂) in 2D. The arrow from a to b is the vector b − a = (b₁ − a₁, b₂ − a₂). Its length is Pythagoras on the components:
distance(a, b) = ‖b − a‖ = √( (b₁ − a₁)² + (b₂ − a₂)² )
In 3D, you add a third squared term. In n dimensions, you sum n squared differences:
distance(a, b) = √( Σ (bᵢ − aᵢ)² )
This is Euclidean distance, the workhorse measure for "how far apart". It doesn't care about absolute position, only about the gap. It treats every dimension equally. It's symmetric: distance(a, b) = distance(b, a). And it scales naturally with dimension: a 768-dim embedding has the same formula as a 2D vector, with 768 squared differences inside the square root instead of two.
Far apart. Euclidean distance treats every dimension equally and sums all 768 squared differences. Even if 68 of them contribute nothing, the other 700 push the total far above zero. This is one of the practical reasons high-dimensional embeddings need careful handling: a few dimensions of mismatch are enough to dominate the distance, and not every dimension carries equal semantic weight. The fix for that, when you need it, is to switch from Euclidean to something direction-based. Coming up.
Walk through the bench example again. A small precision screwdriver and a big mechanic's screwdriver are obviously related. They're both screwdrivers. But they're very different sizes. If you embed them as vectors whose magnitude tracks something like "physical size" or "frequency in the catalogue", their tip positions in space might end up far apart, while their directions from the origin point the same way.
That's the moment to introduce the second flavour of similarity. There are two ways two vectors can be alike:
Both are useful. Which one you want depends on what the magnitudes mean in your space. In a lot of learned embedding spaces, magnitude encodes things like "how often does this token appear" or "how strong is this signal", and meaning lives in direction. That makes direction-based similarity the right tool for most semantic retrieval work.
Cosine similarity asks one question: how aligned are the directions of these two vectors?
The answer is the cosine of the angle between them.
That's the whole intuition. The number runs from −1 (opposite) through 0 (uncorrelated) to +1 (aligned), and it ignores magnitude entirely. A short arrow and a long arrow pointing the same way both score 1.
The formula uses the dot product. For two vectors a and b:
a · b = a₁·b₁ + a₂·b₂ + … + aₙ·bₙ (componentwise multiply, then sum) cosine(a, b) = (a · b) / (‖a‖ · ‖b‖)
You don't need to derive this. The intuition is the load-bearing part: a · b is big and positive when the arrows align, near zero when they're perpendicular, and negative when they point opposite. Dividing by the two magnitudes strips out the size information, leaving a pure direction measure.
This is the operation that runs underneath most modern semantic retrieval. When you query a vector database with "show me documents like this one", what the database is doing, at the bottom of the stack, is computing cosine similarity (or some close cousin) between your query vector and every document vector, then returning the top few. The maths is exactly this formula, executed many millions of times.
Once you have a distance (or a similarity), you have neighbourhoods. The k nearest neighbours of a point are the k other points with the smallest distance to it. The ε-neighbourhood of a point is everything within distance ε of it. Both are just thresholds on the same underlying measure.
A cluster is a region of space with high local density of points. You don't need to write a fancy definition. A cluster is a place on the map where many things landed near each other and few things landed in the surrounding area. K-means, DBSCAN, hierarchical clustering: they all formalise this in different ways, but they're all doing the same thing.
The interesting question is why clusters appear in learned embedding spaces at all. The answer comes back to the training objective. When the loss rewards "similar items have similar representations", gradient descent pulls similar items toward each other in the geometry. Over millions of updates, that pulling produces dense regions where related items pile up. The clusters aren't designed in; they're a side effect of the gradient signal acting on a continuous space.
This is why clustering on a well-trained embedding space often picks up semantically meaningful groups. The categories show up in the geometry because the training process was, indirectly, a clustering pressure all along.
Because the contexts that "dog", "cat", "lion", and "tiger" appear in overlap. They all show up in sentences about feeding, sleeping, hunting, fur, four-legged, mammal. The next-token objective rewards representations that handle those contexts well, and the cheapest way to handle overlapping contexts is to have overlapping representations. So all four end up in a broader "animal" neighbourhood, with pets closer to each other inside that neighbourhood and wild animals closer to each other in their own sub-region. The hierarchy emerged from the training distribution. Nobody designed it in.
Now the industry vocabulary stops being mysterious.
A vector database stores a large collection of embedding vectors (documents, images, products, users), along with the original items those vectors came from. Its job is to answer one question quickly: "given a query vector q, which stored vectors are nearest to it?"
That's it. The whole product category is built around making one geometric operation fast at scale. Distance (Euclidean or cosine-based) is the relevance signal. "Nearest" is the relevance ranking. There's no symbolic understanding inside the database, no rules engine, no ontology. The database is doing geometry.
Brute force works for small corpora: compute the distance from q to every stored vector, sort, return the top K. Past a few hundred thousand vectors at high dimension, brute force gets too slow. So real systems use approximate nearest-neighbour indexes (HNSW, IVF, ScaNN, FAISS) that trade a small amount of recall for orders of magnitude in query latency. The trade is mechanical: the index pre-organises the geometry so that most of the space can be ruled out cheaply.
The same geometric machinery extends past text. Modern multimodal models (CLIP, SigLIP, family) train text encoders and image encoders together so that an image of a dog and the caption "a photograph of a dog" land near each other in a shared embedding space. Same space, two encoders feeding it.
Once the geometry is shared, the operations you already know start crossing modalities. Cosine similarity between an image vector and a text vector measures how well the caption describes the image. Nearest-neighbour search over a corpus of image vectors, using a text query, becomes text-to-image retrieval. Clustering over the joint space finds visual-and-textual themes simultaneously.
The mechanism is the same one this lesson opened with: training shapes a space so that geometric relationships mean what you want them to mean. The only new ingredient is that the training loss now ties two encoders to the same space at once.
L11 ended with a half-formed version of "geometry enables generalisation". This lesson sharpens it.
A learned model is, roughly, a continuous function from input space through a representation space to an output. Continuity means that small changes in the representation produce small changes in the output. The training process builds the representation space so that task-relevant similarity shows up as small distance. Put those two together and you get the operational claim: if the model works on input x, and x' is close to x in the representation, the model probably also works on x'.
That's why dense, well-organised regions of an embedding space are where a model is reliable, and sparse or oddly-shaped regions are where it's brittle. Interpolation works because the geometry is dense and smooth. Extrapolation fails because the geometry isn't telling you anything about the new region.
This is the moment to look back at the calibration scenario where a semantic-search system worked on common queries and failed on rare ones (C1.12). The mechanism for that failure now reads as one sentence: the geometry was well-formed in the dense region of the training distribution and unconstrained in the sparse region, so distances in the sparse region stopped meaning anything.
‖v(dog) − v(cat)‖ < ‖v(dog) − v(car)‖. Every downstream operation that depends on similarity (retrieval, ranking, clustering, generalisation, attention) is a geometric operation on this space. The training objective shaped the space. The model navigates it.
The same geometric operations show up at every scale. Only the implementation moves.
What changes across tiers is dimensionality (sometimes), precision (fp32 → int8), index strategy (brute force → HNSW → IVF-PQ), and which constraint dominates (memory on edge, tail latency at scale). The operation is the same: compute a similarity, find the nearest, return the top.
The geometric story is powerful enough that it's worth naming the places it doesn't hold.
First, sparse regions of training. The embedding space is well-organised where training data was dense. In sparse regions, the geometry is essentially unconstrained, so distances don't track meaning. This is the rare-query failure from C1.12.
Second, distribution shift. If your query distribution looks different from your training distribution, query vectors may land in regions of the space the model doesn't handle reliably, even if those regions are dense for the corpus.
Third, embedding collapse. A poorly trained contrastive model can collapse most inputs to a single region, making distances near-uniform and similarity meaningless. Cosine similarity above some value for almost everything is a strong signal of collapse.
Fourth, high-dimensional weirdness. In very high-dimensional spaces, random vectors are nearly orthogonal, distance distributions concentrate, and the difference between "near" and "far" can compress. Practical embedding systems work because the learned geometry is far from random; it has structure. But anytime that structure is weak, the high-dim weirdness shows up.
None of these undermine the geometric worldview. They're the operating envelope: the geometry is meaningful inside the regions the training process actually shaped, and unreliable outside them. Reading a system means knowing which region you're in.
You now have the apparatus for representing things as vectors and measuring their relationships geometrically. The next move on the wall is to handle the transformations that act on those vectors: rotations, scalings, projections, the matrix as a single algebraic object that does work on an input vector. That's L13.