Phase 1 mentioned vectors constantly. Embeddings were "points in semantic space". Gradients were "directional information". Attention was "similarity between vectors". Each phrase was a placeholder for a maths claim that hadn't been written down yet.
This lesson writes the first one down. Not because the maths is the point. Because the conceptual model needs vocabulary precise enough to carry the next twelve lessons, and "point in space" stops being precise the moment you have to actually compute something.
The good news: you already understand most of this. You walked the bench in Phase 1 talking about geometry, similarity, distance, and directions. This lesson puts the symbols underneath those words. Nothing new arrives. The intuition you have gains a way to be written.
A scalar is a single number. Temperature is 22°C. The board you're laying out is 3.2mm thick. Your input voltage is 5.0V. Any quantity you can describe with one number is a scalar.
A scalar carries magnitude, and that's it. There's no "where". 22°C doesn't point anywhere. It just is.
That works for some things. It stops working the moment you need to say two things at once and they have to travel together.
The wind isn't 12 km/h. It's 12 km/h from the north-east. A scalar can't carry that. You need two numbers (or a number and an angle) bound into a single object, and you need them to stay bound when you do anything with them.
That object is a vector. The same idea covers force on a mechanical joint (magnitude and line of action), velocity of a moving body (speed and heading), displacement on a map (how far, in which direction), and current flow through a network (amperes, with a sign for polarity). Every time a quantity needs both how much and which way, you reach for a vector.
The simplest mental picture is an arrow. It starts somewhere and points somewhere. Its length is the magnitude. Its direction is the direction. Two arrows are the same vector if they have the same length and the same direction, regardless of where you drew them on the page. The arrow is portable.
Two arrows on a page are easy to picture. The trouble starts when you want to combine them, compare them, or hand one to a computer.
So you impose a coordinate system. Draw two perpendicular axes. The arrow that started at the origin and ended at x units along the first axis and y units along the second is now described by an ordered pair: (x, y). Those two numbers are the components of the vector.
The components depend on the axes you chose. The arrow doesn't. You can rotate the axes and the components change, but the arrow on the page is still the same arrow. That's worth holding onto: components are a description of a vector in a frame; the vector itself is more primitive than its description.
You'll see a few notations for the same object. They mean the same thing.
v⃗ = (3, 4) arrow notation, components in a row
v = [3, 4] programmer notation
v = ⎡ 3 ⎤ column notation, used when matrices show up
⎣ 4 ⎦
The course will mostly use the row form (3, 4) when reading vectors as data, and the column form when matrices arrive in L12. Both describe the same arrow.
The magnitude of v = (3, 4) is its length. Pythagoras gives it: √(3² + 4²) = 5. Written ‖v‖ = 5. The double bars are the magnitude symbol. You'll see them a lot.
Magnitude is a scalar. It collapses the vector back down to one number whenever you only care "how big".
(3, 4). Components project onto the axes (dashed). Magnitude is the length of the arrow. Coordinates are a frame imposed on the arrow; they let us compute, but the arrow's shape is more primitive than any particular set of numbers describing it.w has magnitude 13 and is described in some coordinate frame by components (5, 12). If you rotate the frame 90° clockwise, the components change. Does the magnitude change? Answer before you read on.
Magnitude doesn't change. It's the length of the arrow, which exists before any frame you draw on top. Components are a description; magnitude is a property of the thing being described. This distinction matters because in AI the arrows are real and the coordinate frames are arbitrary; the model's behaviour depends on the arrows, not on the labels we put on them.
If you walk 3 km east and then 4 km north, where do you end up? Not 7 km from where you started, because the two walks aren't in the same direction. You end up 5 km away on a north-east bearing, which is exactly the arrow from origin to (3, 4) in figure 11.2.
Vector addition is that. Componentwise: (3, 0) + (0, 4) = (3, 4). Geometrically: put the tail of the second arrow on the head of the first; the sum is the arrow from the original tail to the new head.
That's it. No formality. The component rule is just bookkeeping for the geometric rule.
Multiplying a vector by a scalar stretches or shrinks it without changing its direction. 2·(3, 4) = (6, 8): same direction, twice as long. 0.5·(3, 4) = (1.5, 2): same direction, half as long. −1·(3, 4) = (−3, −4): same length, opposite direction.
A negative scalar flips the arrow. A scalar between 0 and 1 shrinks it. A scalar greater than 1 grows it. The rule is componentwise: multiply each component by the scalar.
This is the operation that does most of the work during optimisation. A gradient is a vector that points in the direction of steepest increase. Multiply it by a small negative scalar (the learning rate, made negative because we want to decrease the loss), and you get the step you should take to move downhill. Everything else in gradient descent is bookkeeping around that one operation.
If you've got two arrows in the same space, the two most common questions you'll ask are: how far apart are they, and how alike are they?
Distance is the length of the arrow that goes from one tip to the other. In 2D with vectors a = (a₁, a₂) and b = (b₁, b₂), the distance is √((a₁−b₁)² + (a₂−b₂)²). That's just Pythagoras applied to the difference vector a − b. Same idea in 3D, in 100D, in 1000D. The formula keeps its shape; only the number of terms in the sum grows.
Similarity in the loose sense is "how much do these two arrows point the same way". Two arrows pointing in identical directions are maximally similar regardless of length. Two arrows at 90° are uncorrelated. Two arrows pointing opposite ways are maximally dissimilar. L12 will write the exact formula (the dot product, and from it, cosine similarity). For now, the geometric picture is enough: alignment of direction is similarity.
Distance and similarity are the two operations behind almost everything Phase 1 talked about: retrieval (find documents close in embedding space), clustering (group nearby points), classification (find which prototype your input lands nearest), generalisation (assume that points near a known good point are also good).
A 2D vector has 2 components. A 3D vector has 3. A 1000D vector has 1000. The arrow picture stops being literally drawable past 3D, but the algebra doesn't care.
Addition: still componentwise. Scaling: still componentwise. Magnitude: still √(sum of squares of components). Distance: still √(sum of squares of differences). Each formula's shape stays constant; what grows is the number of terms inside the square root.
Most of the geometric intuition transfers, with one caveat to keep at the back of your mind: high-dimensional spaces are weird. Random vectors tend to be roughly orthogonal. Most of the volume of a high-dimensional ball is near its surface. Distances between random points concentrate. Phase 2 will return to these properties when they matter. For this lesson, the safe move is to picture things in 2D or 3D and trust that the algebra carries the picture to higher dimensions, with care taken at the edges.
This is where Phase 1 starts becoming Phase 2.
An embedding is what a model produces when it converts something (a word, an image, a user, a code snippet) into a list of, say, 768 numbers. That list is a vector. The 768 numbers are its components in whatever frame the model's parameters happened to settle on during training.
Why are similar things near each other in that space? Because the training objective made them. A contrastive objective explicitly rewards pulling similar pairs closer and pushing dissimilar ones apart. A next-token objective indirectly does the same: words that play similar grammatical and semantic roles get pulled toward similar internal positions because that's what makes the prediction loss low. Either way, the result is a vector space where geometric closeness corresponds to whatever "similar" meant in the loss.
Once you have that geometry, the operations you already know start doing work. Retrieval becomes a nearest-neighbour query over the embedding store. Clustering becomes finding regions where many points pile up. Analogies show up as consistent directions (the king − man + woman ≈ queen example is exactly vector subtraction and addition). Generalisation becomes the claim that points near a known good point behave like that point, because the geometry was built to make that true.
Now's a useful moment to map the vocabulary back onto the machine you built in Phase 1.
Three observations from that diagram, because they're the reason the next twelve lessons exist.
First, the only thing in the forward pass that isn't a vector is the integer token at the input and the integer token at the output. Everything between them is vector traffic. That's why the apparatus you need to read these systems precisely is vector apparatus.
Second, the weights aren't scalars either. The matrices that act on activations are, internally, collections of vectors arranged in a grid. L12 (matrices) treats them as a single algebraic object, but at the level you can already reason about, they're "a stack of vectors that does something to an input vector".
Third, the gradient is a vector that points in a direction. That's not metaphor. The gradient of the loss with respect to a weight vector is, literally, a vector in the same space. Optimisation steps along it. Scaling it by a small negative scalar (the learning rate) is the entire move that drives training.
The vector operations from this lesson scale across the whole compute spectrum. Same algebra, different constraint sets.
What changes across these tiers is precision (fp32 → fp16 → int8 → int4), parallelism (one core to thousands), batch size, and which constraint dominates. What doesn't change is the operation: add vectors, scale vectors, take inner products, compute distances. The whole spectrum runs on the same maths.
The fourth core law from Phase 1 said geometry enables generalisation. C1 asked you to explain why. The honest answer, in conceptual form, was "because similar inputs land near each other in the learned space, and the model behaves continuously across that space".
Now write that mechanically. "Similar inputs land near each other" means: for two semantically related inputs x and x', the model's internal representations v(x) and v(x') have small distance ‖v(x) − v(x')‖. "The model behaves continuously" means: the function the model implements on those vectors doesn't jump wildly between nearby points. Together: if you've trained on x and the model works, then for any x' with small ‖v(x) − v(x')‖, the model probably also works on x'.
That's the maths of "geometry enables generalisation". It uses one operation from this lesson (distance) and one assumption about the model (continuity, which L13 will firm up). The reason the slogan landed in Phase 1 is that the underlying claim is genuinely simple once you have the vocabulary. The slogan was waiting for you to be able to write it.
distance(v(x), v(x')) small implies model(x) ≈ model(x'). The training objective shapes the representation so that this geometric closeness aligns with task-relevant similarity. When the alignment holds, the model generalises. When it doesn't (out of distribution, sparse training region, adversarial input), it doesn't. The mechanism is geometric, not magical.
If most of those feel solid, the rest of Phase 2 will attach cleanly. If two or three feel wobbly, the flashcards and the retrieval prompts below are built to harden them.
You now have arrows on the board. The next thing pinned next to them is structure: the relationships between vectors. L12 turns the arrows into a chart room. Distance, direction, similarity, clusters, semantic retrieval. The same arrows you built here, suddenly organised.