PHASE 2 · THE WHITEBOARD WALL
L11 · 11 / 78 visited

Vectors. Direction, magnitude, and representation

Lesson 11. First station on the whiteboard wall. ~24 min read + cards + retrieval. Durability tier 1 (bedrock; the first formal maths the rest of Phase 2 attaches to).

📌
Memory palace · Whiteboard wall · station 11
The arrow on the board. A single black arrow drawn on the whiteboard, then a second one, then a coordinate grid laid over them, then a cluster of small dots. The first formal mark on the wall.
Core idea. A vector is a quantity with magnitude and direction. In AI, a vector is a structured location in a learned space. Modern AI systems are, mechanically, machines that move, compare, transform, and optimise vectors. Every "embedding", "activation", "gradient", and "latent state" you met in Phase 1 was a vector by another name.

Why this lesson exists

Phase 1 mentioned vectors constantly. Embeddings were "points in semantic space". Gradients were "directional information". Attention was "similarity between vectors". Each phrase was a placeholder for a maths claim that hadn't been written down yet.

This lesson writes the first one down. Not because the maths is the point. Because the conceptual model needs vocabulary precise enough to carry the next twelve lessons, and "point in space" stops being precise the moment you have to actually compute something.

The good news: you already understand most of this. You walked the bench in Phase 1 talking about geometry, similarity, distance, and directions. This lesson puts the symbols underneath those words. Nothing new arrives. The intuition you have gains a way to be written.

Start with a scalar

A scalar is a single number. Temperature is 22°C. The board you're laying out is 3.2mm thick. Your input voltage is 5.0V. Any quantity you can describe with one number is a scalar.

A scalar carries magnitude, and that's it. There's no "where". 22°C doesn't point anywhere. It just is.

That works for some things. It stops working the moment you need to say two things at once and they have to travel together.

Now you need a vector

The wind isn't 12 km/h. It's 12 km/h from the north-east. A scalar can't carry that. You need two numbers (or a number and an angle) bound into a single object, and you need them to stay bound when you do anything with them.

That object is a vector. The same idea covers force on a mechanical joint (magnitude and line of action), velocity of a moving body (speed and heading), displacement on a map (how far, in which direction), and current flow through a network (amperes, with a sign for polarity). Every time a quantity needs both how much and which way, you reach for a vector.

The simplest mental picture is an arrow. It starts somewhere and points somewhere. Its length is the magnitude. Its direction is the direction. Two arrows are the same vector if they have the same length and the same direction, regardless of where you drew them on the page. The arrow is portable.

figure 11.1 · scalar vs vector one number vs one number + one direction scalar magnitude only 22 °C just a number no direction vector magnitude + direction tail head length = magnitude slope = direction
FIG 11.1. A scalar carries one piece of information: how much. A vector carries two: how much, which way. The arrow is the same vector regardless of where you anchor its tail, because what matters is length and direction, not position.

Components: when arrows have to be added

Two arrows on a page are easy to picture. The trouble starts when you want to combine them, compare them, or hand one to a computer.

So you impose a coordinate system. Draw two perpendicular axes. The arrow that started at the origin and ended at x units along the first axis and y units along the second is now described by an ordered pair: (x, y). Those two numbers are the components of the vector.

The components depend on the axes you chose. The arrow doesn't. You can rotate the axes and the components change, but the arrow on the page is still the same arrow. That's worth holding onto: components are a description of a vector in a frame; the vector itself is more primitive than its description.

Notation

You'll see a few notations for the same object. They mean the same thing.

v⃗ = (3, 4)            arrow notation, components in a row
v = [3, 4]            programmer notation
v = ⎡ 3 ⎤              column notation, used when matrices show up
    ⎣ 4 ⎦

The course will mostly use the row form (3, 4) when reading vectors as data, and the column form when matrices arrive in L12. Both describe the same arrow.

Magnitude

The magnitude of v = (3, 4) is its length. Pythagoras gives it: √(3² + 4²) = 5. Written ‖v‖ = 5. The double bars are the magnitude symbol. You'll see them a lot.

Magnitude is a scalar. It collapses the vector back down to one number whenever you only care "how big".

figure 11.2 · components, coordinates, magnitude the same arrow, now described in numbers x y 0 3 4 v = (3, 4) ‖v‖ = √(3² + 4²) = 5 reading v = (3, 4) component 1: 3 units along x component 2: 4 units along y magnitude: ‖v‖ = 5 (the length) the arrow itself doesn't change if you rotate the axes; only the description does.
FIG 11.2. The arrow from the origin to (3, 4). Components project onto the axes (dashed). Magnitude is the length of the arrow. Coordinates are a frame imposed on the arrow; they let us compute, but the arrow's shape is more primitive than any particular set of numbers describing it.
checkpoint · pause and answer in your head A vector w has magnitude 13 and is described in some coordinate frame by components (5, 12). If you rotate the frame 90° clockwise, the components change. Does the magnitude change? Answer before you read on.

Magnitude doesn't change. It's the length of the arrow, which exists before any frame you draw on top. Components are a description; magnitude is a property of the thing being described. This distinction matters because in AI the arrows are real and the coordinate frames are arbitrary; the model's behaviour depends on the arrows, not on the labels we put on them.

Adding vectors: walking, then walking again

If you walk 3 km east and then 4 km north, where do you end up? Not 7 km from where you started, because the two walks aren't in the same direction. You end up 5 km away on a north-east bearing, which is exactly the arrow from origin to (3, 4) in figure 11.2.

Vector addition is that. Componentwise: (3, 0) + (0, 4) = (3, 4). Geometrically: put the tail of the second arrow on the head of the first; the sum is the arrow from the original tail to the new head.

That's it. No formality. The component rule is just bookkeeping for the geometric rule.

figure 11.3 · vector addition is walking, then walking again component-wise · head-to-tail · same answer a = (3, 1) b = (1, 3) a + b = (4, 4) componentwise a = ( 3 , 1 ) b = ( 1 , 3 ) ------------------- a + b = ( 4 , 4 ) add the first components. add the second components. that's the whole rule.
FIG 11.3. The geometric picture (head-to-tail) and the algebraic picture (add components) give the same answer because they're descriptions of the same operation. Walking 3 east then 1 north then 1 east then 3 north lands you 4 east and 4 north of where you started.

Scaling: stretching, shrinking, flipping

Multiplying a vector by a scalar stretches or shrinks it without changing its direction. 2·(3, 4) = (6, 8): same direction, twice as long. 0.5·(3, 4) = (1.5, 2): same direction, half as long. −1·(3, 4) = (−3, −4): same length, opposite direction.

A negative scalar flips the arrow. A scalar between 0 and 1 shrinks it. A scalar greater than 1 grows it. The rule is componentwise: multiply each component by the scalar.

This is the operation that does most of the work during optimisation. A gradient is a vector that points in the direction of steepest increase. Multiply it by a small negative scalar (the learning rate, made negative because we want to decrease the loss), and you get the step you should take to move downhill. Everything else in gradient descent is bookkeeping around that one operation.

figure 11.4 · scaling a vector same direction (or its reverse), different length v = (3, 1) 2v = (6, 2) ½v = (1.5, 0.5) −v = (−3, −1) positive scalar > 1 stretches. between 0 and 1 shrinks. negative flips.
FIG 11.4. Scaling acts on length, not direction (except a negative scalar, which reverses direction). The same operation, applied to a gradient with a small negative scalar, is one step of gradient descent.

Distance and similarity: what vectors are mostly used for

If you've got two arrows in the same space, the two most common questions you'll ask are: how far apart are they, and how alike are they?

Distance is the length of the arrow that goes from one tip to the other. In 2D with vectors a = (a₁, a₂) and b = (b₁, b₂), the distance is √((a₁−b₁)² + (a₂−b₂)²). That's just Pythagoras applied to the difference vector a − b. Same idea in 3D, in 100D, in 1000D. The formula keeps its shape; only the number of terms in the sum grows.

Similarity in the loose sense is "how much do these two arrows point the same way". Two arrows pointing in identical directions are maximally similar regardless of length. Two arrows at 90° are uncorrelated. Two arrows pointing opposite ways are maximally dissimilar. L12 will write the exact formula (the dot product, and from it, cosine similarity). For now, the geometric picture is enough: alignment of direction is similarity.

Distance and similarity are the two operations behind almost everything Phase 1 talked about: retrieval (find documents close in embedding space), clustering (group nearby points), classification (find which prototype your input lands nearest), generalisation (assume that points near a known good point are also good).

Higher dimensions don't change the picture

A 2D vector has 2 components. A 3D vector has 3. A 1000D vector has 1000. The arrow picture stops being literally drawable past 3D, but the algebra doesn't care.

Addition: still componentwise. Scaling: still componentwise. Magnitude: still √(sum of squares of components). Distance: still √(sum of squares of differences). Each formula's shape stays constant; what grows is the number of terms inside the square root.

Most of the geometric intuition transfers, with one caveat to keep at the back of your mind: high-dimensional spaces are weird. Random vectors tend to be roughly orthogonal. Most of the volume of a high-dimensional ball is near its surface. Distances between random points concentrate. Phase 2 will return to these properties when they matter. For this lesson, the safe move is to picture things in 2D or 3D and trust that the algebra carries the picture to higher dimensions, with care taken at the edges.

Why embeddings are vectors

This is where Phase 1 starts becoming Phase 2.

An embedding is what a model produces when it converts something (a word, an image, a user, a code snippet) into a list of, say, 768 numbers. That list is a vector. The 768 numbers are its components in whatever frame the model's parameters happened to settle on during training.

Why are similar things near each other in that space? Because the training objective made them. A contrastive objective explicitly rewards pulling similar pairs closer and pushing dissimilar ones apart. A next-token objective indirectly does the same: words that play similar grammatical and semantic roles get pulled toward similar internal positions because that's what makes the prediction loss low. Either way, the result is a vector space where geometric closeness corresponds to whatever "similar" meant in the loss.

Once you have that geometry, the operations you already know start doing work. Retrieval becomes a nearest-neighbour query over the embedding store. Clustering becomes finding regions where many points pile up. Analogies show up as consistent directions (the king − man + woman ≈ queen example is exactly vector subtraction and addition). Generalisation becomes the claim that points near a known good point behave like that point, because the geometry was built to make that true.

figure 11.5 · a learned embedding space 2d sketch of what a 768d (or larger) space wants to do embedding dim 1 embedding dim 2 (axes are illustrative; real spaces are 100s of dims) dog cat hamster rabbit animals car truck bike scooter van vehicles apple banana pear orange fruits king queen man woman same direction = same relation king − man + woman ≈ queen
FIG 11.5. A learned embedding space, sketched in 2D. Real spaces have hundreds of dimensions; the geometry of the picture (clusters, directions, neighbours) is what survives the lift. Clusters appear because the training objective pulled similar items together. Repeating directions (like the gender-shift arrow) appear because the same semantic relation gets encoded the same way in the geometry.
compression · what just happened

Where vectors show up across an AI system

Now's a useful moment to map the vocabulary back onto the machine you built in Phase 1.

figure 11.6 · vectors across a neural network every stage carries a vector · forward in orange, gradient in blue input token "cat" (an integer id) → embedding lookup embedding vector e ∈ ℝ⁷⁶⁸ a point in semantic space activations layer-by-layer h ∈ ℝ⁷⁶⁸ at each layer moving vectors through space logits z ∈ ℝ⁵⁰²⁵⁷ one per token → softmax → probability output token "sat" (argmax) embedding table: 50257 vectors of dim 768 layer weights: matrices acting on vectors output projection: vec → score over vocab during training loss collapses output vs target into one scalar. backprop produces a gradient vector for every weight in the network. each gradient points in the direction that would increase the loss; we step the opposite way.
FIG 11.6. Vectors at every stage of a generic neural network. The token is the only thing that isn't a vector (it's an index into the embedding table). The embedding is a vector; activations are vectors; logits are a vector; weights are organised as collections of vectors; gradients are vectors. The whole pipeline, forward and backward, is a long sequence of vector operations.

Three observations from that diagram, because they're the reason the next twelve lessons exist.

First, the only thing in the forward pass that isn't a vector is the integer token at the input and the integer token at the output. Everything between them is vector traffic. That's why the apparatus you need to read these systems precisely is vector apparatus.

Second, the weights aren't scalars either. The matrices that act on activations are, internally, collections of vectors arranged in a grid. L12 (matrices) treats them as a single algebraic object, but at the level you can already reason about, they're "a stack of vectors that does something to an input vector".

Third, the gradient is a vector that points in a direction. That's not metaphor. The gradient of the loss with respect to a weight vector is, literally, a vector in the same space. Optimisation steps along it. Scaling it by a small negative scalar (the learning rate) is the entire move that drives training.

Compute spectrum: same maths, different walls

The vector operations from this lesson scale across the whole compute spectrum. Same algebra, different constraint sets.

microcontroller A small classifier on an MCU runs int8 dot products against a handful of 128-dim prototypes. Same vector math, no GPU.
mobile / edge On-device embeddings for retrieval or wake-word detection. Quantised vector ops on an NPU, often int8 or int4.
workstation gpu Hundreds of millions of fp16 vector ops per inference. Tensor cores accelerate batched vector-vector and vector-matrix operations.
hyperscale Trillions of vector operations per training step across thousands of accelerators. Same primitives; the wall is interconnect and data-pipeline throughput.

What changes across these tiers is precision (fp32 → fp16 → int8 → int4), parallelism (one core to thousands), batch size, and which constraint dominates. What doesn't change is the operation: add vectors, scale vectors, take inner products, compute distances. The whole spectrum runs on the same maths.

"Geometry enables generalisation" becomes mathematical

The fourth core law from Phase 1 said geometry enables generalisation. C1 asked you to explain why. The honest answer, in conceptual form, was "because similar inputs land near each other in the learned space, and the model behaves continuously across that space".

Now write that mechanically. "Similar inputs land near each other" means: for two semantically related inputs x and x', the model's internal representations v(x) and v(x') have small distance ‖v(x) − v(x')‖. "The model behaves continuously" means: the function the model implements on those vectors doesn't jump wildly between nearby points. Together: if you've trained on x and the model works, then for any x' with small ‖v(x) − v(x')‖, the model probably also works on x'.

That's the maths of "geometry enables generalisation". It uses one operation from this lesson (distance) and one assumption about the model (continuity, which L13 will firm up). The reason the slogan landed in Phase 1 is that the underlying claim is genuinely simple once you have the vocabulary. The slogan was waiting for you to be able to write it.

mechanism · the core law, formal version Generalisation is bounded by the geometry of the learned representation: distance(v(x), v(x')) small implies model(x) ≈ model(x'). The training objective shapes the representation so that this geometric closeness aligns with task-relevant similarity. When the alignment holds, the model generalises. When it doesn't (out of distribution, sparse training region, adversarial input), it doesn't. The mechanism is geometric, not magical.

What you should be able to do now

If most of those feel solid, the rest of Phase 2 will attach cleanly. If two or three feel wobbly, the flashcards and the retrieval prompts below are built to harden them.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. The point is to retrieve the mechanism, not to recognise it.

L11 A friend asks "why does an AI system store text as long lists of numbers? Why not just keep the words?" Answer in two short paragraphs, using the vocabulary from this lesson.
Words on their own are discrete tokens. They have identity (this word, not that word) but no built-in notion of similarity, distance, or relationship. The model can't take "dog" and ask "is this near cat or near car" without something more structured to work with. So during training, the model learns to map each token to a vector in a learned space, where similar tokens end up near each other and consistent relationships show up as consistent directions. That space is the embedding space, and the long list of numbers per word is just its coordinates in that space. Once words are vectors, the operations that matter (similarity, retrieval, analogies, smooth interpolation, gradient-based updates) all become well-defined arithmetic. You don't lose the word, by the way; you keep the integer id and add the vector. The vector is for the model to compute on. The word is for humans.
L11 Why does a model trained on next-token prediction tend to produce embeddings where semantically similar words are geometrically close? Trace the mechanism from objective to geometry.
Next-token prediction trains the model to assign high probability to tokens that actually follow a given context. Two words that play similar roles ("dog" and "cat" both follow "the small ___") need to produce similar predictions to score well, because the contexts they appear in are similar. To produce similar predictions through the same network, their internal representations have to be similar too: the network is largely a continuous function from embedding to output, so similar outputs require similar inputs. Gradient descent then does the work. Every training step that involves "dog" and a context, and every training step that involves "cat" in a similar context, nudges their embedding vectors in directions that make the network's prediction work. The vectors end up clustered together because that's the configuration of the embedding space that minimises the loss. Geometry tracks distributional similarity because the objective rewards it.
L11 A gradient is "a vector pointing in the direction of steepest increase of the loss". Restate that in geometric terms a hardware engineer would accept: what does the arrow mean, what are its components, and what operation uses it?
The model has a long list of weights. Group them as a single vector w in a very high-dimensional space (millions to billions of components). The loss is a scalar function of w: feed the model the data, run it forward, compare to targets, get one number. The gradient of the loss at w is the arrow in w-space whose direction is "the way you'd move w if you wanted the loss to go up as fast as possible per unit step". Each component of the gradient is the partial derivative of the loss with respect to one specific weight: "how much does the loss change if I nudge this weight a tiny bit, holding everything else fixed". The operation that uses it is gradient descent: subtract a small scalar multiple of the gradient from w. That moves w in the direction opposite to the steepest increase, which is the direction of steepest decrease. Geometrically, you're standing on a hilly loss surface and taking small steps downhill, where "downhill" is defined precisely by the gradient vector at the spot you're standing on.
L11 A retrieval system stores 50 million document embeddings, each 768-dimensional. A query arrives and you have to return the 10 most relevant documents. Explain mechanistically what's happening, using only vector vocabulary from this lesson plus the systems intuition from Phase 1.
The query is embedded into the same 768-dimensional space the documents live in, using the same model (or one trained to share its geometry). Now the query and all 50 million documents are points in one shared vector space. "Most relevant" is operationalised as "geometrically closest", because the embedding model was trained so that semantic similarity corresponds to small distance (or large cosine similarity) in this space. Mechanically, the system needs to find the 10 document vectors with the smallest distance (or largest similarity) to the query vector. Brute force would compute 50 million distances per query, which is slow. So in practice an approximate nearest-neighbour index (HNSW, IVF, ScaNN) is used to find good candidates without scanning everything. Two Phase 1 connections matter here. (1) This only works because the embedding model arranged the geometry to make distance carry meaning; without that, nearest-neighbour search returns near-random results, which was exactly the brittle-embedding scenario from C1.12. (2) The whole system is a vector operation chain at the input side, then a smaller LLM-shaped operation at the output side; if you wanted to add reranking, you'd add another vector operation (a learned similarity model) on top.
↳ Phase 2 Look ahead. You now understand vectors. What single operation, once you learn it in L12, will let you describe almost every transformation inside a neural network as a single algebraic object acting on a vector? Predict the answer.
Matrix multiplication. A matrix is a structured collection of vectors arranged in a grid; multiplying a matrix by an input vector produces an output vector that's a specific linear combination of the matrix's rows (or columns, depending on convention). Every linear layer in a neural network is a matrix multiplication followed by a small element-wise nonlinearity. Attention is a sequence of matrix multiplications. Embedding lookup is a matrix indexed by an integer. The whole forward pass of a transformer, if you squint, is "apply a matrix, add a vector, apply a nonlinearity, repeat". L12 puts the apparatus under that. You're already most of the way there: a matrix acting on a vector is just "do something structured to the vector". The structure is what L12 names.

Next station

You now have arrows on the board. The next thing pinned next to them is structure: the relationships between vectors. L12 turns the arrows into a chart room. Distance, direction, similarity, clusters, semantic retrieval. The same arrows you built here, suddenly organised.