L8 left us with token IDs. Integer labels drawn from a fixed vocabulary, handed in sequence to whatever computes next.
The problem is that the model cannot compute usefully on integers. Token 1273 ("strawberry") and token 1274 ("hippopotamus") might sit next to each other in a tokeniser's vocabulary purely because of where they fell in BPE training; their IDs encode nothing about meaning. A model fed raw token IDs treats them as categorical labels with no inherent relationships. Every pair of tokens is equidistant in "ID space", and no gradient can flow that exploits operational similarity between tokens.
Embeddings close this gap. Each token ID is mapped, through a learned table, to a high-dimensional continuous vector in a learned vector space. The vector is what the rest of the model actually computes on. The geometry of the resulting space is one of the most important inventions modern AI rests on.
One-hot encoding is the textbook cleanup of the ID problem: a vocabulary-sized vector with a single 1 marking the active token. It removes the false adjacency of integer IDs by making every token equidistant from every other. The cost is that the input dimension explodes to the vocabulary size (50K or more), and the resulting representation still has no useful geometry; every pair of tokens is at the same distance from every other.
Embeddings replace this with a compact continuous representation. The embedding table is vocabulary-size by embedding-dimension (50K × 4096 ≈ 200M parameters in modern systems). Each token ID is replaced by its row from the table, a dense vector of a few thousand floats. The vectors are learned jointly with the rest of the model. The optimiser shapes the geometry to make downstream tasks easy.
Mechanically simple: a 2D lookup. Token ID in, vector out. The table is a learned parameter of the model, updated by gradient descent along with everything else.
Operationally critical: this is the most-touched data structure during inference. Every forward pass, for every token, the table is hit. Memory bandwidth from VRAM to the compute units is the dominant cost for the embedding lookup step. Vendor compilers and inference engines spend real engineering on making embedding gathers fast.
Once tokens become vectors, the model is operating in a continuous space. Vectors can be added, scaled, compared with similarity functions, projected, transformed by matrices. Each operation has a meaningful gradient. The optimiser can move smoothly through the space, nudging vectors and weights in small directions that improve the loss.
Integer IDs allow no such motion. The continuous substrate is what makes learning tractable at scale. The same shape recurs across vision (image embeddings), audio (speech embeddings), tabular ML (categorical feature embeddings), and recommender systems (user and item embeddings).
Train a system to predict context (next token, masked token, image-caption alignment), and the optimiser pushes vectors of tokens that appear in similar contexts toward similar points in the space. The geometry isn't programmed; it falls out of training against any objective that exploits token similarity for prediction.
The word2vec demonstration is the textbook case: train a shallow network to predict surrounding words from a target word; the resulting word embeddings cluster by usage. "Dog" and "puppy" end up near each other. "Purchase" and "buy" end up near each other. "Dog" and "asphalt" end up far apart. The clusters are a side effect of getting good at prediction in a space with the right inductive bias.
Recommendation systems do the same with user and item vectors. Train an embedding where users who liked similar items have similar vectors and items liked by similar users have similar vectors. The trained space supports fast retrieval ("show items near this user") and similarity queries ("show items near this item").
The famous king − man + woman ≈ queen result is geometric. After training, the embedding space had a direction encoding gender and another encoding royalty; both came out as side effects of predicting nearby words. Adding and subtracting those vectors traverses the space along interpretable axes.
This works because the prediction objective rewards consistent encoding of regularities. If gender is predictively useful, the optimiser learns to encode it as a direction. If country-capital relationships are predictively useful (Paris is to France as Tokyo is to Japan), the optimiser encodes those as another direction. The space ends up with as many of these emergent axes as the training signal demanded. The directions are not exact and not always cleanly separable, but the operational consequences are real: linear classifiers trained on top of good embeddings often beat sophisticated classifiers on raw inputs.
Once embeddings exist, retrieval becomes a geometry problem. Given a query embedding, find the points in the space that are closest to it by cosine similarity, dot product, or Euclidean distance. The nearest neighbours are operationally similar to the query.
This is the core mechanism behind semantic search. Embed the documents in a corpus. Embed the query. Return the documents whose embeddings are closest. The results carry semantic similarity, not just lexical overlap. "Find me information about heart disease" returns documents about cardiovascular conditions, myocardial infarction, and cholesterol, even if those documents never used the word "heart" directly.
Exact nearest-neighbour search across a corpus of billions of vectors is computationally expensive. Approximate nearest-neighbour (ANN) algorithms (HNSW, IVF, ScaNN) trade exact accuracy for orders-of-magnitude speedup. Vector databases (FAISS, Pinecone, Weaviate, Milvus, pgvector) productise this pattern at scale. The trade-off is between recall (how often the true nearest neighbour is found) and latency and cost. RAG (retrieval-augmented generation) builds directly on this primitive; treated fully in L59.
A shared embedding space can hold inputs from different modalities. CLIP-style training jointly embeds images and captions: pairs of (image, caption) that go together are pushed close in the shared space; pairs that don't are pushed apart. The result is a space where an image of a dog and the text "a photograph of a dog" land near each other. This enables cross-modal retrieval (search images with text queries, search text with image queries) and zero-shot classification (an unseen image is classified by which text embedding it lands near, with no class-specific training).
Multilingual embeddings work the same way at a different axis. Train sentence embeddings on parallel multilingual corpora and equivalent sentences across languages land near each other. "The cat sat on the mat" in English and its Spanish, Chinese, Arabic translations cluster in one neighbourhood. This makes cross-lingual retrieval possible without translating first. The common shape: pick a prediction or matching task that rewards aligning two distributions in a shared geometry, train, and the optimiser produces aligned embeddings.
The deeper claim: most of what a modern AI system does is geometry. Distances become similarity scores. Directions become operations. Projections become attention. Clustering becomes classification.
Hardware accelerates geometry directly. Dense matmul is what tensor cores do well. Computing similarity scores across millions of embeddings is a batched matmul. Projecting embeddings into a different space is a matmul. The geometry-as-computation framing and the matmul-is-the-instruction-set framing are the same idea looked at from two sides.
The embedding table sits in VRAM and is accessed scatter-randomly during the lookup step. This is an irregular memory access pattern and a known bottleneck. Bandwidth, not compute, dominates.
For very large vocabularies or item catalogues (recommender systems with billions of items, multilingual embeddings with 250K-vocab tokenisers), the embedding table can become the largest single tensor in the system. Inference engines split it across GPUs, cache hot rows in SRAM, and apply quantisation aggressively. 8-bit or 4-bit embeddings, while damaging for some operations, often preserve enough geometric structure for nearest-neighbour and similarity computations. The cost-benefit math is favourable.
ANN indexes (HNSW graphs, IVF clusters) are themselves large data structures with irregular access patterns. Inference at scale often runs these on CPU because the access pattern suits CPUs well; the embeddings themselves may live on GPU.
Embedding collapse. Training produces embeddings that all live in a narrow region of the space. Cosine similarity across pairs is near-uniform; the geometry has no discriminative structure. Common in poorly-tuned contrastive learning. Diagnostic: histogram pairwise similarities on a held-out set.
Hubness. In high-dimensional spaces, a small subset of points become disproportionately popular nearest neighbours; they are "close" to a huge fraction of the rest. This skews retrieval, makes some items over-recommended, and is a known artefact of high-dimensional geometry that doesn't fully go away even with well-trained embeddings.
Semantic drift. The embedding's geometry slowly shifts during continued training or domain adaptation, and downstream systems built on top (semantic search, vector indexes) break in ways that are hard to diagnose. The fix is versioning: treat embeddings as a versioned artifact and rebuild dependent indexes when the embedding changes.
Poor neighbourhood structure. The embedding correctly clusters at the macro level but local neighbourhoods are noisy; the top-10 nearest neighbours include several semantically unrelated items. Often a sign that the training signal was too coarse or the embedding dimension is too small.
Figure 9.1 follows the substrate. Top: token IDs become dense vectors through the embedding-table lookup. Middle: those vectors organise into clusters and directions inside a learned space, with retrieval reducing to a geometry problem. Bottom: the hardware view, where the embedding table dominates the bandwidth picture and ANN indexes plus quantisation make the geometry tractable to serve.
In L1's loop, embeddings are the form of input the system computes on after tokenisation. In L2's terms, the embedding space is a compressed encoding of token identity into something predictively useful. In L3's terms, generalisation through embeddings happens because nearby vectors share predictive structure. In L4's terms, emergent geometric axes (king-queen, country-capital) appear at scale in embedding spaces. In L5's terms, the learning signal shapes which directions the embedding space encodes. In L6's terms, RL agents represent state in embedding-like vectors and the policy operates over their geometry. In L7's terms, embeddings are the substrate where all of representation theory lives. In L8's terms, embeddings are the next layer after tokenisation, where discrete IDs become operational structure.
The model computes through geometry. Token IDs become vectors. Vectors live in a learned space. Similarity becomes distance. Relationships become directions. Retrieval becomes nearest-neighbour search. Classification becomes neighbourhood membership.
Every downstream component (attention, MLP, output projection, retrieval, multimodal alignment) operates on this geometric substrate. Once the substrate is right, the rest is small. When it's wrong, nothing fixes it.
The compass on the bench points where the vectors point. Direction in this space is the operational currency of modern AI.
Lesson 10 sits at the dust cover on the bench (station 10), where the lesson turns from how systems work to what they actually can and cannot do, drawing the honest perimeter of current AI capability with the systems-level vocabulary the phase has built.