PHASE 2 · THE WHITEBOARD WALL

L12 · 12 / 78 visited

Distance, similarity, and semantic geometry

Lesson 12. Second station on the whiteboard wall. ~26 min read + cards + retrieval. Durability tier 1 (bedrock; this is where meaning becomes geometry).

🗺️

Memory palace · Whiteboard wall · station 12

The pinned map. A coordinate grid covered with clustered dots, with red threads connecting nearby points, small angle arcs between selected arrows, and dashed distance markers. The wall is starting to look like a chart room.

Core idea. Modern AI systems don't store meaning as definitions. They store it as positions in a learned vector space. Distance and direction in that space carry semantic content, because the training objective put it there. Retrieval, recommendation, clustering, and semantic search are all geometric operations on top of that one fact.

Why this lesson exists

L11 gave you the arrow on the board. This lesson is where the arrows organise themselves into structure with meaning. By the end, you should look at a vector database, a recommender, a RAG pipeline, and a multimodal model and see one underlying mechanism in all of them: a geometric search problem inside a learned space.

The shift you're making is the one most people miss when they first meet embeddings. A model doesn't "know" that dog is similar to cat the way a human knows it. The model has assigned a vector to each, and during training those vectors got pulled into nearby positions because that's what the loss function rewarded. The similarity is geometric. Reading the system means reading the geometry.

The intuition: things you already know how to do

Lay your tools out on the workbench. Drivers in one zone, pliers in another, scopes off to the right, hand tools on the front edge, expensive measuring gear at the back. You did that without thinking. Related tools sit near each other; unrelated tools sit further apart. If a colleague asks for "something like the calipers", they're going to look in the same region of the bench.

That's everything this lesson is about, applied to vectors. The bench is the embedding space. The position of each tool is the embedding of that tool. Closeness on the bench is similarity. A "search for related items" is a geometric query: look near here.

The maths in this lesson formalises three questions the bench already answers without you noticing:

How far apart are two items? (Distance.)
Are these two items "alike" in some directional sense even if they live at different scales? (Cosine similarity.)
What are the items near this one? (Nearest-neighbour search.)

None of this is new physics. It's bookkeeping for an intuition you already trust.

FIG 12.1. The bench already encodes relatedness as position. Driver and pliers sit close because they're both hand tools. Driver and scope sit far because they aren't. Embedding spaces do exactly this, automatically, in many more dimensions.

Distance, written down

L11 introduced the magnitude of a vector. Distance between two vectors is the magnitude of their difference. That's the whole definition.

Take two points a = (a₁, a₂) and b = (b₁, b₂) in 2D. The arrow from a to b is the vector b − a = (b₁ − a₁, b₂ − a₂). Its length is Pythagoras on the components:

distance(a, b) = ‖b − a‖ = √( (b₁ − a₁)² + (b₂ − a₂)² )

In 3D, you add a third squared term. In n dimensions, you sum n squared differences:

distance(a, b) = √( Σ (bᵢ − aᵢ)² )

This is Euclidean distance, the workhorse measure for "how far apart". It doesn't care about absolute position, only about the gap. It treats every dimension equally. It's symmetric: distance(a, b) = distance(b, a). And it scales naturally with dimension: a 768-dim embedding has the same formula as a 2D vector, with 768 squared differences inside the square root instead of two.

FIG 12.2. Distance between two points is the length of the arrow from one to the other. In 2D it's Pythagoras. In 768 dimensions it's still Pythagoras, with 768 squared differences inside the square root instead of two.

checkpoint · pause and answer in your head Two embedding vectors live in 768-dim space. Their components are wildly different in 700 of those dimensions but identical in the other 68. Are they close or far apart in Euclidean distance? Why?

Far apart. Euclidean distance treats every dimension equally and sums all 768 squared differences. Even if 68 of them contribute nothing, the other 700 push the total far above zero. This is one of the practical reasons high-dimensional embeddings need careful handling: a few dimensions of mismatch are enough to dominate the distance, and not every dimension carries equal semantic weight. The fix for that, when you need it, is to switch from Euclidean to something direction-based. Coming up.

Two ways things can be "close"

Walk through the bench example again. A small precision screwdriver and a big mechanic's screwdriver are obviously related. They're both screwdrivers. But they're very different sizes. If you embed them as vectors whose magnitude tracks something like "physical size" or "frequency in the catalogue", their tip positions in space might end up far apart, while their directions from the origin point the same way.

That's the moment to introduce the second flavour of similarity. There are two ways two vectors can be alike:

Close tip positions. The arrows end near each other. That's small Euclidean distance.
Similar directions. The arrows point the same way, regardless of length. That's small angle between them.

Both are useful. Which one you want depends on what the magnitudes mean in your space. In a lot of learned embedding spaces, magnitude encodes things like "how often does this token appear" or "how strong is this signal", and meaning lives in direction. That makes direction-based similarity the right tool for most semantic retrieval work.

FIG 12.3. Two vectors with the same direction can be far apart in distance if their lengths differ. Two vectors with the same length can be 90° apart in direction. Distance and direction measure different things; the right tool depends on what the magnitudes in your space mean.

Cosine similarity: comparing directions

Cosine similarity asks one question: how aligned are the directions of these two vectors?

The answer is the cosine of the angle between them.

Cos(0°) = 1: arrows point the same way. Maximally similar.
Cos(90°) = 0: arrows are perpendicular. No directional similarity.
Cos(180°) = −1: arrows point opposite. Maximally dissimilar.

That's the whole intuition. The number runs from −1 (opposite) through 0 (uncorrelated) to +1 (aligned), and it ignores magnitude entirely. A short arrow and a long arrow pointing the same way both score 1.

The formula uses the dot product. For two vectors a and b:

a · b = a₁·b₁ + a₂·b₂ + … + aₙ·bₙ        (componentwise multiply, then sum)

cosine(a, b) = (a · b) / (‖a‖ · ‖b‖)

You don't need to derive this. The intuition is the load-bearing part: a · b is big and positive when the arrows align, near zero when they're perpendicular, and negative when they point opposite. Dividing by the two magnitudes strips out the size information, leaving a pure direction measure.

This is the operation that runs underneath most modern semantic retrieval. When you query a vector database with "show me documents like this one", what the database is doing, at the bottom of the stack, is computing cosine similarity (or some close cousin) between your query vector and every document vector, then returning the top few. The maths is exactly this formula, executed many millions of times.

FIG 12.4. Cosine similarity is a single number between −1 and +1 that captures only directional alignment. Aligned arrows score +1, perpendicular arrows score 0, anti-aligned arrows score −1. The arrows' lengths play no part. This is the metric most semantic retrieval systems use, because magnitude in learned embedding spaces often carries information you'd rather ignore.

Neighbourhoods and clusters

Once you have a distance (or a similarity), you have neighbourhoods. The k nearest neighbours of a point are the k other points with the smallest distance to it. The ε-neighbourhood of a point is everything within distance ε of it. Both are just thresholds on the same underlying measure.

A cluster is a region of space with high local density of points. You don't need to write a fancy definition. A cluster is a place on the map where many things landed near each other and few things landed in the surrounding area. K-means, DBSCAN, hierarchical clustering: they all formalise this in different ways, but they're all doing the same thing.

The interesting question is why clusters appear in learned embedding spaces at all. The answer comes back to the training objective. When the loss rewards "similar items have similar representations", gradient descent pulls similar items toward each other in the geometry. Over millions of updates, that pulling produces dense regions where related items pile up. The clusters aren't designed in; they're a side effect of the gradient signal acting on a continuous space.

This is why clustering on a well-trained embedding space often picks up semantically meaningful groups. The categories show up in the geometry because the training process was, indirectly, a clustering pressure all along.

FIG 12.5. A learned embedding space organises related items into dense regions. Within-cluster distance is small; cross-cluster distance is large; and even between clusters, semantic relatedness shows up in geometric closeness (pets and wild animals sit closer to each other than either does to vehicles). The structure is a fingerprint of the training objective.

checkpoint · spot the pattern In FIG 12.5, "pets" and "wild animals" sit closer to each other than either does to "vehicles". The model wasn't told "animals are a category that contains both". Why does that structure appear anyway?

Because the contexts that "dog", "cat", "lion", and "tiger" appear in overlap. They all show up in sentences about feeding, sleeping, hunting, fur, four-legged, mammal. The next-token objective rewards representations that handle those contexts well, and the cheapest way to handle overlapping contexts is to have overlapping representations. So all four end up in a broader "animal" neighbourhood, with pets closer to each other inside that neighbourhood and wild animals closer to each other in their own sub-region. The hierarchy emerged from the training distribution. Nobody designed it in.

Vector databases and semantic retrieval

Now the industry vocabulary stops being mysterious.

A vector database stores a large collection of embedding vectors (documents, images, products, users), along with the original items those vectors came from. Its job is to answer one question quickly: "given a query vector q, which stored vectors are nearest to it?"

That's it. The whole product category is built around making one geometric operation fast at scale. Distance (Euclidean or cosine-based) is the relevance signal. "Nearest" is the relevance ranking. There's no symbolic understanding inside the database, no rules engine, no ontology. The database is doing geometry.

Brute force works for small corpora: compute the distance from q to every stored vector, sort, return the top K. Past a few hundred thousand vectors at high dimension, brute force gets too slow. So real systems use approximate nearest-neighbour indexes (HNSW, IVF, ScaNN, FAISS) that trade a small amount of recall for orders of magnitude in query latency. The trade is mechanical: the index pre-organises the geometry so that most of the space can be ruled out cheaply.

FIG 12.6. Vector-database retrieval. Left: a query vector and the top-K nearest documents inside its neighbourhood, surrounded by a sea of less relevant items. Right: the standard pipeline. The "magic" lives entirely inside the embedding model that produced the vectors in step 1. Everything downstream is geometry.

One more move: cross-modal embeddings

The same geometric machinery extends past text. Modern multimodal models (CLIP, SigLIP, family) train text encoders and image encoders together so that an image of a dog and the caption "a photograph of a dog" land near each other in a shared embedding space. Same space, two encoders feeding it.

Once the geometry is shared, the operations you already know start crossing modalities. Cosine similarity between an image vector and a text vector measures how well the caption describes the image. Nearest-neighbour search over a corpus of image vectors, using a text query, becomes text-to-image retrieval. Clustering over the joint space finds visual-and-textual themes simultaneously.

The mechanism is the same one this lesson opened with: training shapes a space so that geometric relationships mean what you want them to mean. The only new ingredient is that the training loss now ties two encoders to the same space at once.

Geometry as the substrate of generalisation

L11 ended with a half-formed version of "geometry enables generalisation". This lesson sharpens it.

A learned model is, roughly, a continuous function from input space through a representation space to an output. Continuity means that small changes in the representation produce small changes in the output. The training process builds the representation space so that task-relevant similarity shows up as small distance. Put those two together and you get the operational claim: if the model works on input x, and x' is close to x in the representation, the model probably also works on x'.

That's why dense, well-organised regions of an embedding space are where a model is reliable, and sparse or oddly-shaped regions are where it's brittle. Interpolation works because the geometry is dense and smooth. Extrapolation fails because the geometry isn't telling you anything about the new region.

This is the moment to look back at the calibration scenario where a semantic-search system worked on common queries and failed on rare ones (C1.12). The mechanism for that failure now reads as one sentence: the geometry was well-formed in the dense region of the training distribution and unconstrained in the sparse region, so distances in the sparse region stopped meaning anything.

mechanism · the core insight, sharpened Meaning, in a modern AI system, is geometry. The model doesn't represent "dog is similar to cat" with a rule or a symbol. It represents it as ‖v(dog) − v(cat)‖ < ‖v(dog) − v(car)‖. Every downstream operation that depends on similarity (retrieval, ranking, clustering, generalisation, attention) is a geometric operation on this space. The training objective shaped the space. The model navigates it.

Compute spectrum: geometry runs everywhere

The same geometric operations show up at every scale. Only the implementation moves.

microcontroller A wake-word detector compares a 128-dim audio embedding to a handful of prototypes via int8 dot products. Nearest neighbour, no index.

mobile / edge On-device semantic search over 10k notes. Cosine similarity in fp16, brute force or small HNSW index.

workstation Internal docs RAG over 1M chunks. fp16 cosine, HNSW or IVF index, sub-100ms latency per query.

hyperscale Billions of vectors, sharded ANN, GPU-accelerated indexes, multi-tenant. Same maths; the wall is index memory and tail latency.

What changes across tiers is dimensionality (sometimes), precision (fp32 → int8), index strategy (brute force → HNSW → IVF-PQ), and which constraint dominates (memory on edge, tail latency at scale). The operation is the same: compute a similarity, find the nearest, return the top.

When geometry breaks: the small caveats

The geometric story is powerful enough that it's worth naming the places it doesn't hold.

First, sparse regions of training. The embedding space is well-organised where training data was dense. In sparse regions, the geometry is essentially unconstrained, so distances don't track meaning. This is the rare-query failure from C1.12.

Second, distribution shift. If your query distribution looks different from your training distribution, query vectors may land in regions of the space the model doesn't handle reliably, even if those regions are dense for the corpus.

Third, embedding collapse. A poorly trained contrastive model can collapse most inputs to a single region, making distances near-uniform and similarity meaningless. Cosine similarity above some value for almost everything is a strong signal of collapse.

Fourth, high-dimensional weirdness. In very high-dimensional spaces, random vectors are nearly orthogonal, distance distributions concentrate, and the difference between "near" and "far" can compress. Practical embedding systems work because the learned geometry is far from random; it has structure. But anytime that structure is weak, the high-dim weirdness shows up.

None of these undermine the geometric worldview. They're the operating envelope: the geometry is meaningful inside the regions the training process actually shaped, and unreliable outside them. Reading a system means knowing which region you're in.

Compression of the lesson

compression · what to carry forward

Distance between vectors is the magnitude of their difference. Same Pythagoras, more dimensions.
Two flavours of "close": small distance (tip-to-tip) and small angle (direction-only).
Cosine similarity captures the angle, ignores magnitude, runs from −1 to +1.
Neighbourhoods and clusters are emergent properties of the geometry the training objective shaped.
Vector databases are geometric search engines. Semantic search and RAG are nearest-neighbour queries with extra steps.
Meaning, in a modern AI system, is geometry. Retrieval, ranking, generalisation, clustering, and attention are all geometric operations on a learned space.

What you should be able to do now

Compute the Euclidean distance between two small vectors by hand, and explain what it measures.
Explain why two vectors can have very different magnitudes but a cosine similarity of 1.
Read a 2D scatter of an embedding space and identify clusters, neighbourhoods, and likely hierarchical structure.
Describe, in mechanism, why a vector database returns "related" items rather than exact matches.
Explain why RAG is, at bottom, a geometric search followed by an LLM call.
Name three places the geometric story breaks down, and what each one looks like operationally.
Restate "geometry enables generalisation" as a precise claim about distances and continuity in a learned representation space.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace the geometric mechanism; don't summarise definitions.

L12 Explain in your own words why semantic search returns "related" documents instead of exact-keyword matches. Trace the mechanism from query through to result.

The query gets converted into a vector by the same embedding model that produced the document vectors. Because the embedding model was trained so that semantically related text lands near related text in the embedding space, the query vector falls into a region of the space populated by documents about similar topics, regardless of whether they share keywords. A nearest-neighbour search (brute force or ANN-indexed) returns the documents whose vectors are closest to the query vector. Distance here is the relevance signal. "Related" emerges automatically because the geometry was shaped by the training loss to make semantic similarity correspond to geometric closeness. The query "how does an MCU sleep" matches a document about "low-power microcontroller idle modes" even with zero keyword overlap, because both phrases live in the same neighbourhood of the embedding space.

L12 Why is cosine similarity preferred over Euclidean distance for most text-embedding retrieval, despite Euclidean being simpler? Reference the practical behaviour of magnitudes in learned embedding spaces.

In many learned text-embedding spaces, the magnitude of a vector tracks things you'd rather ignore: how frequently the term appears, how confident the model is, the length of the document, the norm bias accumulated during training. Meaning lives in direction, not in length. Euclidean distance is sensitive to magnitude, so two arrows pointing the same way but with different lengths would be flagged as different. Cosine similarity strips the magnitude out entirely (the formula divides by both norms) and measures only directional alignment, which is what you usually want. There's also a practical numerical reason: cosine similarity is bounded in [−1, +1], which makes thresholding and ranking cleaner than unbounded distance values. Some systems normalise all vectors to unit length at index time, which makes Euclidean and cosine equivalent and lets the system use the simpler distance metric. The choice is mechanical: pick the metric whose invariances match what you want to ignore.

L12 A team builds a clustering system on top of a pretrained sentence-embedding model. The clusters look semantically meaningful on common topics but messy on specialised technical jargon. Explain mechanistically, using only L11 + L12 vocabulary.

Clusters in an embedding space are dense regions where similar items piled up during training. Their shape and meaningfulness depend on how much training data covered that part of the space. For common topics (general English, ordinary domains), the training corpus was dense; the embedding model spent many gradient steps shaping that region, so distances there carry semantic meaning and clusters look clean. For specialised technical jargon, the training corpus was sparse; the embedding model has few examples to anchor those vectors, so they end up in under-constrained regions of the space. Distances between them are essentially noise (they're closer to "random vectors are nearly orthogonal in high dimensions" than to "semantically related items are close"). Clustering algorithms then group these vectors based on noise rather than meaning, producing the messy output the team observed. The fix is domain-adaptation: fine-tune the embedding model on in-domain text so the relevant region of the space gets shaped by gradient descent, or use a domain-specific embedding model that was pretrained on technical material.

L12 A multimodal model embeds images and text into the same 1024-dim space. Describe three concrete things you can now do with a single operation (nearest-neighbour search) that you couldn't do before. Name the operation in each case.

(1) Text-to-image retrieval. Embed the text query into the shared space. Run nearest-neighbour search against the image embeddings. Return the images whose vectors are closest to the query vector. The operation is exactly the same as text-to-text retrieval; the only change is that the items at the other end of the index happen to be images. (2) Image-to-text retrieval / captioning by retrieval. Embed an image. Run nearest-neighbour search against a corpus of caption embeddings. Return the captions closest to the image vector. Useful for tagging, labelling, or alt-text generation. (3) Zero-shot classification. Embed the image. Embed each class name as text (e.g. "a photo of a cat", "a photo of a dog", "a photo of a car"). Run nearest-neighbour over the text embeddings, return the closest class. No classifier was trained for this task; the geometry of the shared space alone makes the operation work. All three reduce to the same primitive: cosine similarity between the query vector and a set of candidate vectors. The model isn't doing different things across modalities; the geometry is doing the work, and the modalities are just two ways of producing points in the same space.

L12 The sentence "meaning has become geometry" is doing a lot of work in this lesson. Restate it in three different ways: (a) as a claim about how the model stores semantic content, (b) as a claim about how downstream operations work, (c) as a claim about why generalisation happens.

(a) Storage. The model does not store the fact that "dog is similar to cat" in a symbolic table, rule, or definition. It stores it as a relative position: the embedding vectors for "dog" and "cat" are placed close together in the learned space, while the embedding for "car" is placed further away. The semantic content is the geometry. Change the geometry and you change what the model "knows". (b) Operations. Every downstream operation that depends on semantic relationships becomes a geometric one. Retrieval is nearest-neighbour search. Recommendation is finding items near a user's representation. Clustering is finding dense regions. Classification is finding the prototype closest to the input. Attention is a weighted average of value vectors based on geometric similarity to a query. The "semantic" operation and the "geometric" operation are the same operation. (c) Generalisation. If two inputs land near each other in the representation space, and the model is continuous (small input changes produce small output changes), then the model's behaviour on one transfers approximately to the other. Generalisation is the claim that this near-by transfer holds across the regions of the space that training shaped well. The mechanism for why a model handles inputs it never saw is geometric: the new input lands near training inputs, and the model behaves the same way on nearby points.

Next station

You now have the apparatus for representing things as vectors and measuring their relationships geometrically. The next move on the wall is to handle the transformations that act on those vectors: rotations, scalings, projections, the matrix as a single algebraic object that does work on an input vector. That's L13.

← Lesson 11 Lesson 13 →