L11 gave you the arrow. L12 turned the geometry between arrows into similarity and retrieval. L13 explained why those arrows live in spaces with 768 or 4096 dimensions: each axis is a representational degree of freedom. None of that yet says how a model gets from one representation to another. A vector goes into an embedding layer; a different vector comes out of the next layer. Then another, then another. Something is doing the transformation. That something is a matrix.
The trap most early treatments fall into is teaching matrices as tables of numbers and matrix multiplication as a bookkeeping ritual. That order produces students who can multiply matrices and have no idea what the multiplication means. This lesson does it the other way around. The matrix is a transformation of space first; the rectangle of numbers is how we write it down second.
If you've seen matrices before, this is the unlearning step. The rectangle of numbers is the description, not the thing.
Compare this with vectors. L11 was clear that the arrow is more primitive than the list of components used to describe it. Rotate the coordinate frame and the components change; the arrow doesn't. The same is true for matrices, one level up. The transformation a matrix represents is primitive. The grid of numbers is one way to write that transformation down inside a chosen coordinate frame. Different frame, different numbers, same transformation.
So the right mental object for a matrix is not the grid. It's the verb. A matrix does something to every vector in the space at once. The numbers in the grid encode what it does, but the doing is the point.
The atomic act of a matrix is to take a vector and produce another vector. Write it as A·v = w. Read it as "apply A to v, get w". The matrix is the function; the input vector is what you feed it; the output vector is the result.
Two things to hold onto from this single picture.
First, the input and the output live in vector spaces. They may be the same space (a matrix that rotates a 2D arrow returns another 2D arrow) or different spaces (a matrix that maps a 768-dim embedding to a 4096-dim hidden state). Matrices are how a model moves a representation from one space to another, including from a smaller space to a larger one and back.
Second, the transformation is the same for every vector. The matrix doesn't "see" what vector you fed it. It applies the same operation to all of them. This is what linear means: the matrix bends the space consistently, in one go, for every point at once.
You'll see "linear transformation" everywhere in AI material. The word has a specific, useful meaning that's worth holding.
A linear transformation is a transformation that respects two simple properties:
Together those two properties mean the transformation behaves uniformly across the whole space. If you know what it does to a couple of reference vectors (the standard axes), you know what it does to every vector, because every other vector is just a sum and a scaling of the axes.
This is the structural reason matrices are so useful. The transformation is fully described by what it does to the basis vectors. Each column of the matrix is, literally, "this is where the corresponding basis vector lands". Read a matrix that way and the grid of numbers stops looking arbitrary.
Almost every linear transformation you'll meet in AI is a combination of four primitive moves: stretching, compressing, rotating, and shearing. Each one has a clean visual story. Holding the four pictures makes the entire algebra easier.
A pure stretch scales the space along one or more axes. A 2× stretch in the x direction doubles every x-coordinate and leaves y alone. The grid widens; arrows pointing along x get longer; the relative arrangement of everything stays the same. In matrix form, a diagonal matrix with entries (2, 1) does this; the columns say "the first basis vector now points to (2, 0); the second still points to (0, 1)".
In a neural network, stretching shows up whenever a layer amplifies some learned directions and dampens others. Important features get stretched into the dominant directions of the representation; nuisance variation gets shrunk away. The optimiser shapes those stretches because that's what makes the loss go down.
The flip side of stretching. A compression scales one or more axes toward zero. Push the x-coordinate of every vector by 0.5 and the grid narrows; push it by zero and the entire space collapses onto the y-axis. That last case is a projection: a higher-dimensional space getting flattened onto a lower-dimensional one.
Compression is where dimensions get cut. An attention layer that mixes a 4096-dim hidden state into a 1024-dim head is, structurally, a projection: a matrix whose output dimension is smaller than its input. Information about the input that lived in the directions being collapsed is lost in that step. The matrix is the editor deciding what to keep.
A rotation reorients the space without stretching or compressing it. The grid swings around the origin; arrows keep their lengths; relative angles between arrows stay the same. The 2D rotation by θ degrees has columns (cos θ, sin θ) and (−sin θ, cos θ); each basis vector lands on the unit circle, just at a new angle.
Rotations realign which directions are which. A model can use a rotation to swap which axis encodes which feature, or to align an internal representation with an external coordinate system. Rotations are common as components inside more complex transformations (think of any change of basis: it's a rotation, sometimes with a stretch).
A shear tilts the space. Horizontal lines stay horizontal; vertical lines tilt over by an angle proportional to their height. The grid turns into a parallelogram. Shears are the workhorse of "the input axes weren't quite the right axes": a shear can take a correlated set of directions and tilt them into a less correlated one.
In learned representations, shears appear whenever a layer needs to mix some directions into others while leaving the rest alone. Combined with stretches, shears can decorrelate features, separate previously-overlapping classes, and rotate one part of the space without disturbing another. The richness of "any linear transformation" comes from combining shears, stretches, and rotations across many dimensions at once.
Words can only carry these so far. The grid picture makes the four operations sit next to each other.
One matrix bends the space once. Apply a second matrix and the space gets bent again, starting from where the first bend left it. That's composition. The combined effect of "first do A, then do B" is itself a linear transformation, so it can also be written as a matrix. That matrix is called the product B·A, and computing it is what matrix multiplication actually means.
Composition is why the order in a product matters. Stretching first and then rotating gives a different result from rotating first and then stretching. The grid ends up in a different shape. In matrix form: B·A and A·B are usually different transformations, and the algebra correctly tracks the order.
This is also why you can fuse layers in a model. If two consecutive layers are pure linear maps (no nonlinearity between them), the pair is mathematically equivalent to a single matrix that combines both. Real architectures interleave nonlinearities precisely so the layers can't be collapsed: each nonlinearity prevents the composition from being just another linear map, and that's what gives the network its expressive power. Without the nonlinearity, "a deep stack of layers" reduces to "one layer", which couldn't model anything beyond a linear function.
Now the central claim of this lesson, written down. A neural network is, structurally, a sequence of matrix-driven transformations of a learned representation, interleaved with small element-wise nonlinearities. Each layer takes the previous representation as input, applies a learned matrix to it, applies a nonlinearity, and hands the result to the next layer.
Every box marked W_i in that picture is a matrix the model has learned. Together they encode what the model "knows": which directions in the input matter, how features get combined, which axes the final layer needs to read out to produce the answer. Training is the process of changing those matrices' entries so the loss goes down.
That picture is structurally the same for almost everything: image classifiers, language models, recommender ranking models, protein folding networks. The boxes get fancier (attention is its own internal arrangement of matrix multiplications; convolutions are matrices with special sparsity patterns; transformer blocks are repeated copies of a particular wiring), but the underlying objects are matrices acting on representations. Reading any architecture is, in the end, reading a graph of matrix multiplications.
Phase 1 introduced embeddings: vectors that encode meaning as direction in a learned space. L9 made the embedding table feel like geometry. What that account left implicit was how a model uses an embedding once it has one.
The answer is now visible. The embedding is the first vector in the chain. Every layer after the input applies its own matrix, which moves that vector through a sequence of latent spaces. By the time the embedding reaches the last layer, it's been transformed many times. Each transformation reshaped what the dimensions mean. The "raw" embedding of the word "dog" enters one space; deep inside the model, that vector has been bent into a representation that includes context, syntactic role, sentiment, and whatever else the loss rewarded the network for tracking.
This is what "the model is doing its real work in latent space" means. The transformations between layers are what gradually mould the input embedding into something the final layer can act on. Without the matrices, the embedding never gets transformed; with them, it travels through dozens of learned coordinate systems before producing an output.
"Feature extraction" is a phrase you'll see all over AI material. It usually gestures at something fuzzy. With matrices in hand, it's concrete: feature extraction is what happens when a matrix is applied to a representation and the output coordinates correspond to higher-level features than the input coordinates did.
A 2D image fed into a CNN starts in pixel space (axes = individual pixels). After the first convolutional layer (which is a matrix with a special structure), the representation is in a space whose axes correspond to small local patterns: edges at various orientations, colour blobs, low-level textures. After more layers, the axes correspond to parts (eyes, wheels, leaves). Deeper still, the axes correspond to object identities. Each step is a matrix doing a transformation that re-bases the space onto more abstract features.
"The model learned to extract features" is, mechanically, "the model learned matrices whose rows correspond to features useful for the task". The features are columns and rows of those matrices, baked in during training. That's it. No separate machinery; the matrices are the features.
Three reasons matrices show up everywhere.
First, expressiveness. The set of linear transformations is rich enough that, combined with nonlinearities, the resulting compositions can approximate essentially any function the task demands. The mathematical phrase is "universal function approximation"; the practical consequence is that you don't need a fundamentally new kind of object to build a working model. You need stacks of these.
Second, gradients. The derivative of "a matrix applied to a vector" with respect to the matrix entries is clean and computable. That makes the whole network differentiable, which is what lets gradient descent work. Other operations either have nasty derivatives or aren't differentiable at all; matrices are the operation that plays well with the optimisation machinery.
Third, hardware. Matrix multiplication maps cleanly onto the silicon that became cheap enough to scale in the 2010s. A matrix multiply on a GPU is thousands of small arithmetic units running in parallel; the operation is hardware-friendly in a way few alternatives are. This is the line L1 and S1 both came back to: hardware shaped which architectures won, and the architectures that won were the ones built around matmul. Phase 3 will name the silicon piece by piece. Phase 4 will read the surviving architectures. Both rest on the matrix as the load-bearing operation.
Matrices cost two things: memory to store, and compute to apply.
Storage is the easy part. A matrix mapping a d-dimensional input to a d-dimensional output has d² entries. In fp16 (2 bytes each), a 4096 × 4096 matrix is 32 MB. A frontier model has hundreds of such matrices across its layers, which is why parameter counts climb into the tens or hundreds of billions.
Compute is the load-bearing part. Applying a (d_out × d_in) matrix to a length-d_in vector takes d_out × d_in multiply-adds. Doubling the hidden width quadruples that cost per layer (the d² scaling L13 named). Across a deep model, those multiplications are the bulk of training and inference compute. Phase 3 will return to this when tensor cores and the roofline model land; for now, hold "matmul is the operation, and it's the operation that the whole hardware story is written around".
Matrices appear at every tier of the compute spectrum; what changes is how aggressively they're compressed.
Same operation, different scale. Whether you're squeezing a tiny classifier into a microcontroller or training a 400B model on a cluster, the kernel doing the actual work is matrix multiplication.
A few things the matrix view does not, by itself, do.
It doesn't explain how the matrices are chosen. Their entries come from training; that's the next phase's territory. This lesson is about what the matrices, once chosen, mean.
It doesn't carry the nonlinearity. Pure linear stacks collapse to one linear map; the σ in every figure above is what lets depth do anything beyond what a single layer could. The nonlinearity gets its own treatment when we read transformer blocks.
It doesn't yet describe special-structured matrices: convolutions (matrices with weight sharing), attention (matrices whose entries are themselves computed from input), embeddings (matrices indexed by integer ids). Each will get its own lesson; each will be a matrix with a particular shape.
What it does do is install the right object as the primary one. Reading any model from here on, the question to put to every box is "what matrix is doing the work, and what space does it map between?". That habit is what this lesson exists to build.
You now have arrows, geometry, dimensions, and the machine that bends a whole space. The next move stays in geometry but changes the question from how to move a representation to what to keep. A matrix can shrink a space as well as bend it, and when it does, some information survives and the rest is discarded. That choice, what to preserve and what to throw away, runs through embeddings, attention, and feature extraction. L15 puts the stack of transparent sheets on the wall.