PHASE 1 · FOUNDATIONS OF INTELLIGENCE

09 / 78

Embeddings and learned geometry

Lesson 9. Phase 1: Foundations of intelligence. ~25 min read + cards + retrieval. Durability tier 1 (bedrock).

📐

Memory palace · Bench · station 9

The compass. Direction in vector space; geometric relationships; neighbourhood structure; navigation through learned representation spaces.

Core idea. Embeddings transform discrete symbols into continuous geometric representations where operational similarity becomes spatial proximity, allowing optimisation and generalisation to work efficiently.

Why this lesson exists

L8 left us with token IDs. Integer labels drawn from a fixed vocabulary, handed in sequence to whatever computes next.

The problem is that the model cannot compute usefully on integers. Token 1273 ("strawberry") and token 1274 ("hippopotamus") might sit next to each other in a tokeniser's vocabulary purely because of where they fell in BPE training; their IDs encode nothing about meaning. A model fed raw token IDs treats them as categorical labels with no inherent relationships. Every pair of tokens is equidistant in "ID space", and no gradient can flow that exploits operational similarity between tokens.

Embeddings close this gap. Each token ID is mapped, through a learned table, to a high-dimensional continuous vector in a learned vector space. The vector is what the rest of the model actually computes on. The geometry of the resulting space is one of the most important inventions modern AI rests on.

Token IDs are unusable

One-hot encoding is the textbook cleanup of the ID problem: a vocabulary-sized vector with a single 1 marking the active token. It removes the false adjacency of integer IDs by making every token equidistant from every other. The cost is that the input dimension explodes to the vocabulary size (50K or more), and the resulting representation still has no useful geometry; every pair of tokens is at the same distance from every other.

Embeddings replace this with a compact continuous representation. The embedding table is vocabulary-size by embedding-dimension (50K × 4096 ≈ 200M parameters in modern systems). Each token ID is replaced by its row from the table, a dense vector of a few thousand floats. The vectors are learned jointly with the rest of the model. The optimiser shapes the geometry to make downstream tasks easy.

The embedding table

Mechanically simple: a 2D lookup. Token ID in, vector out. The table is a learned parameter of the model, updated by gradient descent along with everything else.

Operationally critical: this is the most-touched data structure during inference. Every forward pass, for every token, the table is hit. Memory bandwidth from VRAM to the compute units is the dominant cost for the embedding lookup step. Vendor compilers and inference engines spend real engineering on making embedding gathers fast.

Vector space as substrate for computation

Once tokens become vectors, the model is operating in a continuous space. Vectors can be added, scaled, compared with similarity functions, projected, transformed by matrices. Each operation has a meaningful gradient. The optimiser can move smoothly through the space, nudging vectors and weights in small directions that improve the loss.

Integer IDs allow no such motion. The continuous substrate is what makes learning tractable at scale. The same shape recurs across vision (image embeddings), audio (speech embeddings), tabular ML (categorical feature embeddings), and recommender systems (user and item embeddings).

Nearby equals similar

Train a system to predict context (next token, masked token, image-caption alignment), and the optimiser pushes vectors of tokens that appear in similar contexts toward similar points in the space. The geometry isn't programmed; it falls out of training against any objective that exploits token similarity for prediction.

The word2vec demonstration is the textbook case: train a shallow network to predict surrounding words from a target word; the resulting word embeddings cluster by usage. "Dog" and "puppy" end up near each other. "Purchase" and "buy" end up near each other. "Dog" and "asphalt" end up far apart. The clusters are a side effect of getting good at prediction in a space with the right inductive bias.

Recommendation systems do the same with user and item vectors. Train an embedding where users who liked similar items have similar vectors and items liked by similar users have similar vectors. The trained space supports fast retrieval ("show items near this user") and similarity queries ("show items near this item").

Directions encode relationships

The famous king − man + woman ≈ queen result is geometric. After training, the embedding space had a direction encoding gender and another encoding royalty; both came out as side effects of predicting nearby words. Adding and subtracting those vectors traverses the space along interpretable axes.

This works because the prediction objective rewards consistent encoding of regularities. If gender is predictively useful, the optimiser learns to encode it as a direction. If country-capital relationships are predictively useful (Paris is to France as Tokyo is to Japan), the optimiser encodes those as another direction. The space ends up with as many of these emergent axes as the training signal demanded. The directions are not exact and not always cleanly separable, but the operational consequences are real: linear classifiers trained on top of good embeddings often beat sophisticated classifiers on raw inputs.

Nearest neighbours and retrieval

Once embeddings exist, retrieval becomes a geometry problem. Given a query embedding, find the points in the space that are closest to it by cosine similarity, dot product, or Euclidean distance. The nearest neighbours are operationally similar to the query.

This is the core mechanism behind semantic search. Embed the documents in a corpus. Embed the query. Return the documents whose embeddings are closest. The results carry semantic similarity, not just lexical overlap. "Find me information about heart disease" returns documents about cardiovascular conditions, myocardial infarction, and cholesterol, even if those documents never used the word "heart" directly.

Exact nearest-neighbour search across a corpus of billions of vectors is computationally expensive. Approximate nearest-neighbour (ANN) algorithms (HNSW, IVF, ScaNN) trade exact accuracy for orders-of-magnitude speedup. Vector databases (FAISS, Pinecone, Weaviate, Milvus, pgvector) productise this pattern at scale. The trade-off is between recall (how often the true nearest neighbour is found) and latency and cost. RAG (retrieval-augmented generation) builds directly on this primitive; treated fully in L59.

Multimodal and multilingual embeddings

A shared embedding space can hold inputs from different modalities. CLIP-style training jointly embeds images and captions: pairs of (image, caption) that go together are pushed close in the shared space; pairs that don't are pushed apart. The result is a space where an image of a dog and the text "a photograph of a dog" land near each other. This enables cross-modal retrieval (search images with text queries, search text with image queries) and zero-shot classification (an unseen image is classified by which text embedding it lands near, with no class-specific training).

Multilingual embeddings work the same way at a different axis. Train sentence embeddings on parallel multilingual corpora and equivalent sentences across languages land near each other. "The cat sat on the mat" in English and its Spanish, Chinese, Arabic translations cluster in one neighbourhood. This makes cross-lingual retrieval possible without translating first. The common shape: pick a prediction or matching task that rewards aligning two distributions in a shared geometry, train, and the optimiser produces aligned embeddings.

Geometry as computation

The deeper claim: most of what a modern AI system does is geometry. Distances become similarity scores. Directions become operations. Projections become attention. Clustering becomes classification.

Hardware accelerates geometry directly. Dense matmul is what tensor cores do well. Computing similarity scores across millions of embeddings is a batched matmul. Projecting embeddings into a different space is a matmul. The geometry-as-computation framing and the matmul-is-the-instruction-set framing are the same idea looked at from two sides.

Hardware and memory

The embedding table sits in VRAM and is accessed scatter-randomly during the lookup step. This is an irregular memory access pattern and a known bottleneck. Bandwidth, not compute, dominates.

For very large vocabularies or item catalogues (recommender systems with billions of items, multilingual embeddings with 250K-vocab tokenisers), the embedding table can become the largest single tensor in the system. Inference engines split it across GPUs, cache hot rows in SRAM, and apply quantisation aggressively. 8-bit or 4-bit embeddings, while damaging for some operations, often preserve enough geometric structure for nearest-neighbour and similarity computations. The cost-benefit math is favourable.

ANN indexes (HNSW graphs, IVF clusters) are themselves large data structures with irregular access patterns. Inference at scale often runs these on CPU because the access pattern suits CPUs well; the embeddings themselves may live on GPU.

Failure modes

Embedding collapse. Training produces embeddings that all live in a narrow region of the space. Cosine similarity across pairs is near-uniform; the geometry has no discriminative structure. Common in poorly-tuned contrastive learning. Diagnostic: histogram pairwise similarities on a held-out set.

Hubness. In high-dimensional spaces, a small subset of points become disproportionately popular nearest neighbours; they are "close" to a huge fraction of the rest. This skews retrieval, makes some items over-recommended, and is a known artefact of high-dimensional geometry that doesn't fully go away even with well-trained embeddings.

Semantic drift. The embedding's geometry slowly shifts during continued training or domain adaptation, and downstream systems built on top (semantic search, vector indexes) break in ways that are hard to diagnose. The fix is versioning: treat embeddings as a versioned artifact and rebuild dependent indexes when the embedding changes.

Poor neighbourhood structure. The embedding correctly clusters at the macro level but local neighbourhoods are noisy; the top-10 nearest neighbours include several semantically unrelated items. Often a sign that the training signal was too coarse or the embedding dimension is too small.

Three views of embedding geometry

Figure 9.1 follows the substrate. Top: token IDs become dense vectors through the embedding-table lookup. Middle: those vectors organise into clusters and directions inside a learned space, with retrieval reducing to a geometry problem. Bottom: the hardware view, where the embedding table dominates the bandwidth picture and ANN indexes plus quantisation make the geometry tractable to serve.

FIG 9.1. Three views of embeddings. Top: token IDs flow through the embedding table and emerge as dense vectors; the table is a learned parameter of the model. Middle: vectors organise into clusters in the learned space with directions that encode relationships and neighbourhoods that support retrieval; multimodal alignment lets images and text share one geometry. Bottom: the hardware view; embedding table sits in VRAM, accessed by bandwidth-bound scatter-gather, with ANN indexes and aggressive quantisation making the geometry tractable to serve at scale.

The L1 to L8 view

In L1's loop, embeddings are the form of input the system computes on after tokenisation. In L2's terms, the embedding space is a compressed encoding of token identity into something predictively useful. In L3's terms, generalisation through embeddings happens because nearby vectors share predictive structure. In L4's terms, emergent geometric axes (king-queen, country-capital) appear at scale in embedding spaces. In L5's terms, the learning signal shapes which directions the embedding space encodes. In L6's terms, RL agents represent state in embedding-like vectors and the policy operates over their geometry. In L7's terms, embeddings are the substrate where all of representation theory lives. In L8's terms, embeddings are the next layer after tokenisation, where discrete IDs become operational structure.

The takeaway

The model computes through geometry. Token IDs become vectors. Vectors live in a learned space. Similarity becomes distance. Relationships become directions. Retrieval becomes nearest-neighbour search. Classification becomes neighbourhood membership.

Every downstream component (attention, MLP, output projection, retrieval, multimodal alignment) operates on this geometric substrate. Once the substrate is right, the rest is small. When it's wrong, nothing fixes it.

The compass on the bench points where the vectors point. Direction in this space is the operational currency of modern AI.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L9 Take a recommender system serving a catalogue of 10 million products. The team proposes two designs: (A) treat each product as a categorical ID; train a model to predict purchase probability given user features and product ID; (B) learn a 256-dimensional embedding per product and train the same purchase-prediction task with the embedding as input. Compare these designs on (i) optimisation efficiency, (ii) generalisation to new and rare products, and (iii) downstream retrieval ("show similar products"). What changes if the embedding dimension is too small? Too large?

Optimisation efficiency. (A) The model must learn an entirely separate set of parameters per product ID; no transfer between similar products, so the optimiser needs many observations per product to extract signal. New products start with effectively random behaviour. (B) The embedding table lets the optimiser share statistical strength across similar products: a product similar to a popular one inherits useful structure from its neighbours. Gradient information per product is much higher. Generalisation. (A) New products are out-of-vocabulary in the worst sense; the model has no way to predict for them except through user-side features. Rare products are equally hopeless. (B) The embedding is a learned representation; a new product's vector can be initialised from metadata or observed user interactions, and the geometry of the existing space provides a strong prior. Cold-start improves substantially. Downstream retrieval. (A) "Similar products" has no native mechanism; you'd need a separate similarity model. (B) Retrieval is geometry: cosine similarity in the embedding space. Same artifact serves prediction and retrieval. Embedding-dim trade-off. Too small: the optimiser cannot fit enough distinct directions; products collapse together; retrieval is noisy. Too large: VRAM blows up (10M × 4096 × 4B = 160 GB), and the model can memorise per-product behaviour without generalising; the space is sparsely populated and develops hubness artefacts. Right dim is empirical; typically 64-512 for this catalogue size.

L9 A team builds a semantic search system over a million internal documents using a publicly available embedding model. Recall is decent for general English queries but terrible for queries about company-specific concepts (product names, internal codenames, technical jargon). Sketch the mechanistic explanations rooted in embedding geometry, and what 3 fixes you would consider with the trade-off for each.

Mechanistic explanations. (1) The public embedding model was trained on internet-scale text where company-specific terms appear rarely or not at all. The embedding vectors for these terms (or their constituent subwords) live in a region of the space that wasn't well-shaped by training. They cluster poorly, get pulled toward common-English neighbours, and don't carry internal semantic structure. (2) The tokeniser fragments company-specific names into subwords that the embedding model treats as ordinary English fragments. A query for "Atlas Build System" might tokenise as ["Atlas", "Build", "System"] and the query embedding is an average of three English-word meanings, not a representation of the specific internal product. (3) Even where vocabulary covers the terms adequately, the model has no signal about internal usage patterns. Public geometry reflects public training distribution; internal neighbourhood structure differs. Three fixes. (i) Fine-tune the public embedding on internal corpus with a contrastive objective. Trade-off: requires labelled or pseudo-labelled pairs, and the embedding becomes coupled to the company; you'll maintain it. (ii) Add company-specific terms to the tokeniser vocabulary and train embeddings for them on the internal corpus. Trade-off: changing the tokeniser invalidates all embeddings built with the old one, so dependent systems rebuild. (iii) Hybrid retrieval: combine embedding-based semantic search with classical lexical search (BM25) for terms lexical methods handle well; re-rank with the embedding model. Trade-off: more pipeline complexity, but works without retraining anything and is the standard production answer for this exact failure.

↳ L10 (Forward interleave to L10, what current AI can and can't do.) Embedding-based systems (semantic search, RAG, recommendation, multimodal retrieval) are some of the most reliable parts of the modern AI stack. Without doing the L10 deep-dive, sketch why this is true (what makes embedding systems durable), and where they are still fragile in ways downstream of the geometry they operate on.

Embedding systems are durable for a few mechanistic reasons. First, the operation is geometric and well-understood: nearest-neighbour search has clear semantics and known diagnostics (collapse, hubness, drift). Second, the artifact is decoupled from the use: one embedding model serves search, classification, recommendation, and retrieval, so engineering investment amortises across many downstream systems. Third, hardware support is excellent; matmul and approximate nearest-neighbour are mature operations with industrial implementations. Where they are fragile. (1) The geometry reflects the training distribution. Out-of-distribution queries land in poorly-shaped regions and produce unreliable similarity. (2) Embeddings have no introspectable concept of certainty. A query that lands far from any reasonable cluster will still return its "nearest" neighbours; downstream systems built on raw similarity scores can present nonsense confidently. (3) The geometry is only as good as the contrastive signal that built it. If the training objective was lexical similarity but deployment wants semantic similarity, the system fails predictably. (4) Embedding-based retrieval lacks compositional reasoning. The query "documents about cardiovascular conditions in patients under 30" might match cardiovascular documents broadly, but the embedding has no clean way to enforce the patient-age constraint. Production semantic search systems usually combine embedding retrieval with structured filters and re-rankers. L10 takes this further: the honest map of where current AI is solid, brittle, and confidently wrong.

Next station

Lesson 10 sits at the dust cover on the bench (station 10), where the lesson turns from how systems work to what they actually can and cannot do, drawing the honest perimeter of current AI capability with the systems-level vocabulary the phase has built.

← Lesson 8 Lesson 10 →