PHASE 2 · THE WHITEBOARD WALL

L14 · 14 / 79 visited

Matrices and linear transformations

Lesson 14. Fourth station on the whiteboard wall. ~26 min read + cards + retrieval. Durability tier 1 (bedrock; the matrix as a machine, not as a table).

🔲

Memory palace · Whiteboard wall · station 14

The grid. A square grid drawn on the wall, then the same grid bent: stretched, rotated, sheared, collapsed. Each bend is a matrix at work. The arrows from L11 still sit on the wall; the grid shows what happens to every arrow at once when a matrix is applied.

Core idea. A matrix is not primarily a table of numbers. It is a machine that transforms vectors: vector in, vector out, with the entire space bent in one consistent way. Modern neural networks are mostly stacks of these transformations, applied to representations layer after layer. Reading any AI model with confidence starts here.

Why this lesson exists

L11 gave you the arrow. L12 turned the geometry between arrows into similarity and retrieval. L13 explained why those arrows live in spaces with 768 or 4096 dimensions: each axis is a representational degree of freedom. None of that yet says how a model gets from one representation to another. A vector goes into an embedding layer; a different vector comes out of the next layer. Then another, then another. Something is doing the transformation. That something is a matrix.

The trap most early treatments fall into is teaching matrices as tables of numbers and matrix multiplication as a bookkeeping ritual. That order produces students who can multiply matrices and have no idea what the multiplication means. This lesson does it the other way around. The matrix is a transformation of space first; the rectangle of numbers is how we write it down second.

A matrix is not a table

If you've seen matrices before, this is the unlearning step. The rectangle of numbers is the description, not the thing.

Compare this with vectors. L11 was clear that the arrow is more primitive than the list of components used to describe it. Rotate the coordinate frame and the components change; the arrow doesn't. The same is true for matrices, one level up. The transformation a matrix represents is primitive. The grid of numbers is one way to write that transformation down inside a chosen coordinate frame. Different frame, different numbers, same transformation.

So the right mental object for a matrix is not the grid. It's the verb. A matrix does something to every vector in the space at once. The numbers in the grid encode what it does, but the doing is the point.

Vector in, vector out

The atomic act of a matrix is to take a vector and produce another vector. Write it as A·v = w. Read it as "apply A to v, get w". The matrix is the function; the input vector is what you feed it; the output vector is the result.

flowchart LR V["vector v
(3, 1)"]:::vec --> A[["matrix A
transformation"]]:::mat --> W["vector Av
(transformed)"]:::vec classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef mat fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;

FIG 14.1. The atomic act. A matrix takes one vector and returns another. The arrow doesn't move; the matrix is the box in the middle. Reading any neural-network forward pass is reading a chain of these.

Two things to hold onto from this single picture.

First, the input and the output live in vector spaces. They may be the same space (a matrix that rotates a 2D arrow returns another 2D arrow) or different spaces (a matrix that maps a 768-dim embedding to a 4096-dim hidden state). Matrices are how a model moves a representation from one space to another, including from a smaller space to a larger one and back.

Second, the transformation is the same for every vector. The matrix doesn't "see" what vector you fed it. It applies the same operation to all of them. This is what linear means: the matrix bends the space consistently, in one go, for every point at once.

What "linear" actually means

You'll see "linear transformation" everywhere in AI material. The word has a specific, useful meaning that's worth holding.

A linear transformation is a transformation that respects two simple properties:

Scaling commutes: stretching a vector first and then transforming it gives the same result as transforming it first and then stretching by the same amount.
Addition commutes: adding two vectors first and then transforming gives the same result as transforming both and then adding the outputs.

Together those two properties mean the transformation behaves uniformly across the whole space. If you know what it does to a couple of reference vectors (the standard axes), you know what it does to every vector, because every other vector is just a sum and a scaling of the axes.

This is the structural reason matrices are so useful. The transformation is fully described by what it does to the basis vectors. Each column of the matrix is, literally, "this is where the corresponding basis vector lands". Read a matrix that way and the grid of numbers stops looking arbitrary.

mechanism · what a matrix's columns mean Column 1 of the matrix is where the first basis vector ends up after the transformation. Column 2 is where the second basis vector ends up. And so on. Once you've placed the new basis vectors, every other vector follows by linearity: it lands wherever the same combination of new basis vectors puts it.

The four canonical things matrices do

Almost every linear transformation you'll meet in AI is a combination of four primitive moves: stretching, compressing, rotating, and shearing. Each one has a clean visual story. Holding the four pictures makes the entire algebra easier.

Stretching

A pure stretch scales the space along one or more axes. A 2× stretch in the x direction doubles every x-coordinate and leaves y alone. The grid widens; arrows pointing along x get longer; the relative arrangement of everything stays the same. In matrix form, a diagonal matrix with entries (2, 1) does this; the columns say "the first basis vector now points to (2, 0); the second still points to (0, 1)".

In a neural network, stretching shows up whenever a layer amplifies some learned directions and dampens others. Important features get stretched into the dominant directions of the representation; nuisance variation gets shrunk away. The optimiser shapes those stretches because that's what makes the loss go down.

Compressing

The flip side of stretching. A compression scales one or more axes toward zero. Push the x-coordinate of every vector by 0.5 and the grid narrows; push it by zero and the entire space collapses onto the y-axis. That last case is a projection: a higher-dimensional space getting flattened onto a lower-dimensional one.

Compression is where dimensions get cut. An attention layer that mixes a 4096-dim hidden state into a 1024-dim head is, structurally, a projection: a matrix whose output dimension is smaller than its input. Information about the input that lived in the directions being collapsed is lost in that step. The matrix is the editor deciding what to keep.

Rotating

A rotation reorients the space without stretching or compressing it. The grid swings around the origin; arrows keep their lengths; relative angles between arrows stay the same. The 2D rotation by θ degrees has columns (cos θ, sin θ) and (−sin θ, cos θ); each basis vector lands on the unit circle, just at a new angle.

Rotations realign which directions are which. A model can use a rotation to swap which axis encodes which feature, or to align an internal representation with an external coordinate system. Rotations are common as components inside more complex transformations (think of any change of basis: it's a rotation, sometimes with a stretch).

Shearing

A shear tilts the space. Horizontal lines stay horizontal; vertical lines tilt over by an angle proportional to their height. The grid turns into a parallelogram. Shears are the workhorse of "the input axes weren't quite the right axes": a shear can take a correlated set of directions and tilt them into a less correlated one.

In learned representations, shears appear whenever a layer needs to mix some directions into others while leaving the rest alone. Combined with stretches, shears can decorrelate features, separate previously-overlapping classes, and rotate one part of the space without disturbing another. The richness of "any linear transformation" comes from combining shears, stretches, and rotations across many dimensions at once.

Space deformation, in one picture

Words can only carry these so far. The grid picture makes the four operations sit next to each other.

FIG 14.2. The four primitive bends. Blue grid is the original space; amber grid is what the matrix did to it. Every linear transformation, in any number of dimensions, is built from combinations of these four shapes. A multi-layer neural network is many of these stacked together, applied to learned representations.

Composing transformations

One matrix bends the space once. Apply a second matrix and the space gets bent again, starting from where the first bend left it. That's composition. The combined effect of "first do A, then do B" is itself a linear transformation, so it can also be written as a matrix. That matrix is called the product B·A, and computing it is what matrix multiplication actually means.

flowchart LR V["v"]:::vec --> A[["A
first transform"]]:::mat --> AV["A·v"]:::vec --> B[["B
second transform"]]:::mat --> BAV["B·A·v"]:::vec AV -.->|equivalent to| BA[["(B·A)
composed matrix"]]:::comp V -.-> BA BA -.-> BAV classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef mat fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef comp fill:#1d2230,stroke:#4ade80,color:#e6e8ee,stroke-dasharray:4 3;

FIG 14.3. Composing transformations. Doing A then B is the same as doing the single composed transformation (B·A) once. That's why matrix multiplication is defined the way it is: it lets us combine many transformations into one. The order matters (B·A is generally not equal to A·B).

Composition is why the order in a product matters. Stretching first and then rotating gives a different result from rotating first and then stretching. The grid ends up in a different shape. In matrix form: B·A and A·B are usually different transformations, and the algebra correctly tracks the order.

This is also why you can fuse layers in a model. If two consecutive layers are pure linear maps (no nonlinearity between them), the pair is mathematically equivalent to a single matrix that combines both. Real architectures interleave nonlinearities precisely so the layers can't be collapsed: each nonlinearity prevents the composition from being just another linear map, and that's what gives the network its expressive power. Without the nonlinearity, "a deep stack of layers" reduces to "one layer", which couldn't model anything beyond a linear function.

Neural networks are stacks of transformations

Now the central claim of this lesson, written down. A neural network is, structurally, a sequence of matrix-driven transformations of a learned representation, interleaved with small element-wise nonlinearities. Each layer takes the previous representation as input, applies a learned matrix to it, applies a nonlinearity, and hands the result to the next layer.

flowchart LR X["x
input"]:::vec --> W1[["W₁ · σ"]]:::mat --> H1["h₁"]:::vec --> W2[["W₂ · σ"]]:::mat --> H2["h₂"]:::vec --> W3[["W₃ · σ"]]:::mat --> H3["h₃"]:::vec --> WN[["W_n"]]:::mat --> Y["y
output"]:::vec classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef mat fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;

FIG 14.4. A neural network as transformation stack. Each W_i is a matrix; σ is the nonlinearity that prevents the stack from collapsing into a single linear map. The hidden states h_i live in successively bent representation spaces, each shaped by all the matrices below it.

Every box marked W_i in that picture is a matrix the model has learned. Together they encode what the model "knows": which directions in the input matter, how features get combined, which axes the final layer needs to read out to produce the answer. Training is the process of changing those matrices' entries so the loss goes down.

That picture is structurally the same for almost everything: image classifiers, language models, recommender ranking models, protein folding networks. The boxes get fancier (attention is its own internal arrangement of matrix multiplications; convolutions are matrices with special sparsity patterns; transformer blocks are repeated copies of a particular wiring), but the underlying objects are matrices acting on representations. Reading any architecture is, in the end, reading a graph of matrix multiplications.

Embeddings move through learned transformations

Phase 1 introduced embeddings: vectors that encode meaning as direction in a learned space. L9 made the embedding table feel like geometry. What that account left implicit was how a model uses an embedding once it has one.

The answer is now visible. The embedding is the first vector in the chain. Every layer after the input applies its own matrix, which moves that vector through a sequence of latent spaces. By the time the embedding reaches the last layer, it's been transformed many times. Each transformation reshaped what the dimensions mean. The "raw" embedding of the word "dog" enters one space; deep inside the model, that vector has been bent into a representation that includes context, syntactic role, sentiment, and whatever else the loss rewarded the network for tracking.

This is what "the model is doing its real work in latent space" means. The transformations between layers are what gradually mould the input embedding into something the final layer can act on. Without the matrices, the embedding never gets transformed; with them, it travels through dozens of learned coordinate systems before producing an output.

Feature extraction, mechanically

"Feature extraction" is a phrase you'll see all over AI material. It usually gestures at something fuzzy. With matrices in hand, it's concrete: feature extraction is what happens when a matrix is applied to a representation and the output coordinates correspond to higher-level features than the input coordinates did.

A 2D image fed into a CNN starts in pixel space (axes = individual pixels). After the first convolutional layer (which is a matrix with a special structure), the representation is in a space whose axes correspond to small local patterns: edges at various orientations, colour blobs, low-level textures. After more layers, the axes correspond to parts (eyes, wheels, leaves). Deeper still, the axes correspond to object identities. Each step is a matrix doing a transformation that re-bases the space onto more abstract features.

"The model learned to extract features" is, mechanically, "the model learned matrices whose rows correspond to features useful for the task". The features are columns and rows of those matrices, baked in during training. That's it. No separate machinery; the matrices are the features.

Why this is the language of deep learning

Three reasons matrices show up everywhere.

First, expressiveness. The set of linear transformations is rich enough that, combined with nonlinearities, the resulting compositions can approximate essentially any function the task demands. The mathematical phrase is "universal function approximation"; the practical consequence is that you don't need a fundamentally new kind of object to build a working model. You need stacks of these.

Second, gradients. The derivative of "a matrix applied to a vector" with respect to the matrix entries is clean and computable. That makes the whole network differentiable, which is what lets gradient descent work. Other operations either have nasty derivatives or aren't differentiable at all; matrices are the operation that plays well with the optimisation machinery.

Third, hardware. Matrix multiplication maps cleanly onto the silicon that became cheap enough to scale in the 2010s. A matrix multiply on a GPU is thousands of small arithmetic units running in parallel; the operation is hardware-friendly in a way few alternatives are. This is the line L1 and S1 both came back to: hardware shaped which architectures won, and the architectures that won were the ones built around matmul. Phase 3 will name the silicon piece by piece. Phase 4 will read the surviving architectures. Both rest on the matrix as the load-bearing operation.

mechanism · why deep learning is matrix-shaped

Matrices are rich enough to express any linear transformation, and combined with nonlinearities, any function we'd want to fit.
They're differentiable, so the optimiser can change their entries by gradient descent.
They map onto matmul-shaped silicon (tensor cores, GPUs, TPUs) that's an order of magnitude faster at this one operation than at general-purpose code.
Three reasons, one verdict: the field built itself around the matrix because each of those three was load-bearing.

Costs, briefly

Matrices cost two things: memory to store, and compute to apply.

Storage is the easy part. A matrix mapping a d-dimensional input to a d-dimensional output has d² entries. In fp16 (2 bytes each), a 4096 × 4096 matrix is 32 MB. A frontier model has hundreds of such matrices across its layers, which is why parameter counts climb into the tens or hundreds of billions.

Compute is the load-bearing part. Applying a (d_out × d_in) matrix to a length-d_in vector takes d_out × d_in multiply-adds. Doubling the hidden width quadruples that cost per layer (the d² scaling L13 named). Across a deep model, those multiplications are the bulk of training and inference compute. Phase 3 will return to this when tensor cores and the roofline model land; for now, hold "matmul is the operation, and it's the operation that the whole hardware story is written around".

Compute spectrum: where matrices live

Matrices appear at every tier of the compute spectrum; what changes is how aggressively they're compressed.

microcontroller Tiny weight matrices (often int8), heavy quantisation, often only a handful of small layers. Matmul on a small MCU is still the dominant cost.

mobile / edge Hidden widths 256–768 with int4 or int8 weights. NPUs accelerate matmul directly; the model is a chain of matrices sized to fit RAM and battery.

workstation 4096-wide matrices at fp16/bf16, dozens to hundreds of layers. Tensor cores on consumer GPUs already turn matmul into the cheapest operation by far.

hyperscale 8192–18432-wide matrices, mixed-precision matmul on dedicated AI accelerators, sharded across thousands of chips. The whole infrastructure exists to keep matmul fed.

Same operation, different scale. Whether you're squeezing a tiny classifier into a microcontroller or training a 400B model on a cluster, the kernel doing the actual work is matrix multiplication.

What lands and what doesn't

A few things the matrix view does not, by itself, do.

It doesn't explain how the matrices are chosen. Their entries come from training; that's the next phase's territory. This lesson is about what the matrices, once chosen, mean.

It doesn't carry the nonlinearity. Pure linear stacks collapse to one linear map; the σ in every figure above is what lets depth do anything beyond what a single layer could. The nonlinearity gets its own treatment when we read transformer blocks.

It doesn't yet describe special-structured matrices: convolutions (matrices with weight sharing), attention (matrices whose entries are themselves computed from input), embeddings (matrices indexed by integer ids). Each will get its own lesson; each will be a matrix with a particular shape.

What it does do is install the right object as the primary one. Reading any model from here on, the question to put to every box is "what matrix is doing the work, and what space does it map between?". That habit is what this lesson exists to build.

compression · what to carry forward

A matrix is a machine that transforms vectors. The grid of numbers is the description, not the thing.
Linear means: the same transformation applied uniformly to every vector at once.
Each column of a matrix is "where the corresponding basis vector lands after the transformation".
Every linear transformation is a combination of four primitives: stretch, compress, rotate, shear.
Matrix multiplication composes transformations: (B·A) means "do A first, then B".
Neural networks are stacks of matrices acting on representations, interleaved with nonlinearities that prevent the stack from collapsing.
Embeddings move through this chain of transformations; each layer's matrix re-bases the representation onto a new set of features.
Matrices won as the basic object because they're expressive, differentiable, and match matmul-shaped silicon.

What you should be able to do now

Explain why a matrix is more usefully thought of as a transformation than as a table.
Describe what each column of a matrix represents geometrically.
Name the four primitive bends and sketch what each one does to a square grid.
Explain what matrix multiplication means in terms of composing transformations.
Sketch a neural network as a chain of matrices and say why the nonlinearity between them is load-bearing.
Connect "feature extraction" to a specific matrix re-basing a representation onto more abstract axes.
Trace why matmul, not something else, became the dominant operation in deep learning hardware.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace mechanism; don't summarise.

L14 A colleague describes a matrix as "just a 2D array of numbers". Push back from this lesson's view. What's the better mental object, and what's at stake in the difference?

The better mental object is a machine that transforms vectors: a matrix takes any vector in its input space and returns another vector in its output space, bending the whole space in one consistent way as it does so. The 2D array is one way to write the transformation down inside a chosen coordinate frame, but it's the description, not the thing. The reason this matters operationally is that if you think of matrices as arrays, matrix multiplication looks like an arithmetic ritual with no clear meaning. If you think of matrices as transformations, matrix multiplication immediately reads as "do this transformation, then that one", and the order-dependence (B·A is generally not A·B) stops being mysterious. Almost every deep-learning intuition (why nonlinearities matter, what a layer does to an embedding, why doubling width quadruples compute, why attention is what it is) flows cleanly from the transformation view and resists the array view.

L14 Sketch in words what the matrix [[2, 1], [0, 1]] does to a square grid. Decompose the answer into primitive bends.

Read the columns. Column 1 is (2, 0): the first basis vector (which used to point at (1, 0)) now points at (2, 0). That's a stretch by 2 along x. Column 2 is (1, 1): the second basis vector (which used to point at (0, 1)) now points at (1, 1). That's a shear (the y-axis has tilted to the right by 1 unit per unit of y). So on a square grid, the transformation does two things: it stretches the grid horizontally to twice its width, and it shears the upper edge to the right by one grid square. The result is a parallelogram, twice as wide as it is tall and tilted. No rotation and no vertical scaling. The matrix is the combination of "stretch in x" and "shear in x by y". This decomposition is how any linear transformation reads, once you've seen the four primitives: any matrix you meet can be unpacked into combinations of stretches, compressions, rotations, and shears.

L14 Why is the nonlinearity (σ) between layers load-bearing? What collapses if you remove it?

A composition of linear transformations is itself a linear transformation. If two consecutive layers are pure matrix multiplications (no nonlinearity between them), the pair is mathematically equivalent to a single matrix that's the product of the two. Stack twenty such layers, no nonlinearities, and the whole stack collapses to a single matrix. Whatever function the twenty-layer network could compute, a one-layer network with that single product matrix can also compute. So depth buys you nothing if everything is linear; the network's expressive power is capped at "linear functions only". That's a serious limit because most interesting tasks are not linearly solvable in input space (XOR was the L13 example; pattern recognition in images is a much bigger example). The nonlinearity between layers (ReLU, GeLU, etc.) breaks the linear-composition law: f(σ(g(x))) is genuinely a different shape of function from f(g(x)) and from g(f(x)). Depth becomes meaningful. The network can build up nested non-linear functions whose decision surfaces are far richer than any single matrix could express. That richness, in turn, is what lets deep networks fit hard tasks at all. Take the σ out and the architecture becomes uselessly redundant; put it in and the same layers gain composability.

↩ L9 L9 said embeddings live in a learned vector space where similar things land near each other. With matrices in hand, describe what happens to an embedding as it passes through the layers of a transformer. Use the vocabulary of representation spaces and transformations.

The input embedding is a vector in the model's embedding space; that space's geometry encodes broad similarity (which words are usually used together, which images look alike). Each transformer layer applies a sequence of matrix multiplications to that vector: a projection matrix down into query/key/value spaces, an attention-weighted recombination (itself implemented as matrices acting on the V vectors), a feed-forward matrix that re-projects into another space, and another matrix that projects back. Each of those operations is a transformation that moves the vector into a slightly different representation space. By the time the vector has passed through twenty or sixty layers, it has been transformed many times; the latent space it occupies near the output is heavily reshaped from the embedding space it started in. Concretely, the input vector for "bank" enters the embedding space carrying a geometry shaped by all uses of the word; by the deeper layers, the same vector has been bent so that context (next to "river" vs next to "money") has separated the senses into different regions of the latent space. The matrices learned during training are what did the separating. So "similar things land near each other" is true in the input embedding space; in the deeper latent spaces it's still true, but the notion of "similar" has been re-defined by the transformations, layer by layer, to whatever the task rewarded. That's how a static embedding table ends up producing context-sensitive behaviour: the matrices in the middle do the work the table couldn't.

↳ Phase 2 Look ahead. Matrices and gradients both live in vector spaces; gradients are vectors. What single later operation, in Phase 2, will let you ask "if I nudge this matrix's entries a tiny bit, how much does the loss change"? Predict the answer and why it sets up backprop later.

Gradients (Phase 2 lesson on gradients and optimisation landscapes). A gradient with respect to a matrix is a matrix of the same shape: each entry is "the partial derivative of the loss with respect to this particular matrix entry". Once you can compute that gradient, you can take a tiny step in the direction that lowers the loss by adjusting every entry of the matrix at once. That's the whole training move: compute the gradient of the loss with respect to every matrix in the model, scale by a small negative learning rate, apply. The "with respect to every matrix" part is what backpropagation is: a systematic way to compute those gradients efficiently by walking backward through the chain of matrix multiplications you set up in this lesson. So the apparatus arrives in two layers. The gradients lesson explains what the gradient of a scalar with respect to its inputs (including matrix-shaped inputs) is. Backprop, when it lands in Phase 4, explains how to actually compute those gradients for a deep stack. The matrix-as-transformation view from this lesson is what makes both make sense: the gradient is asking "how does bending the space slightly differently affect the loss?", and the answer is precisely the direction the matrix's entries should move.

Next station

You now have arrows, geometry, dimensions, and the machine that bends a whole space. The next move stays in geometry but changes the question from how to move a representation to what to keep. A matrix can shrink a space as well as bend it, and when it does, some information survives and the rest is discarded. That choice, what to preserve and what to throw away, runs through embeddings, attention, and feature extraction. L15 puts the stack of transparent sheets on the wall.

← Lesson 13 Lesson 15 →