L14 left you with the matrix: a machine that bends a whole space, every vector at once. That explains how a model moves a representation from one space to another. It leaves a sharper question open. When a layer turns a 4096-number hidden state into a 64-number attention head, where does the information in the other dimensions go?
It gets dropped. On purpose.
That drop is what this lesson is about. A projection takes a representation and produces a smaller or simpler one, keeping some of the information and discarding the rest. Every embedding, every attention head, every pooling layer, every feature extractor is a version of it.
The reason it matters reaches past any single operation. A system that keeps everything about its input has learned nothing; it's a copy machine. A system that keeps the right things has found structure. Deciding what to keep is most of what intelligence, biological or artificial, actually does.
Start with a shadow. Hold a 3D object up to a light and it casts a 2D shadow on the wall. The shadow is a projection: a map from a higher-dimensional thing (the object) to a lower-dimensional one (the outline).
The shadow keeps some information. You can often recognise a hand, a chair, a person from the shadow alone. It also loses information: depth is gone, and two different objects can cast the same shadow. That pairing, keep some and lose some, is the whole idea.
Mechanically, a projection is a transformation (often a matrix, from L14) whose output lives in fewer dimensions than its input, or in a restricted part of the same space. Apply it and the directions you chose to care about survive; the directions it was built to ignore collapse to nothing.
A projection can't be reversed. Once depth is gone from the shadow, no amount of staring at the wall brings it back. In matrix terms, the projection sends many different inputs to the same output: every object at a given outline maps to that outline. The map is many-to-one, and a many-to-one map has no inverse.
Worked example with small numbers. Take a 3D point (4, 7, 9). Project onto the first two axes (the x-y plane): keep x and y, drop z. The output is (4, 7). The 9 is gone. The points (4, 7, 9), (4, 7, 0), and (4, 7, −2) all land on (4, 7). Three different inputs, one output. The information that separated them lived entirely in the third coordinate, and the projection chose not to keep it.
So naming the cost is half the lesson: a projection is lossy by construction. It always loses information. The only question that matters is whether it kept the right information.
Discarding sounds like damage. Most of the time it's the goal. Four reasons.
Noise. Real inputs carry variation you don't want: sensor jitter, lighting, phrasing, measurement error. A projection that drops the noisy directions and keeps the signal directions hands the next stage a cleaner input. Phase 1 (L7) called this invariance: a good representation is unchanged by transformations that don't matter. A projection is how invariance gets built.
Generalisation. A model that keeps every detail of its training examples memorises them. L3 made memorisation the enemy of generalisation. Dropping surface detail forces the representation toward the structure shared across examples, which is the part that transfers to inputs the model hasn't seen.
Compute and memory. A 64-dim representation costs a fraction of a 4096-dim one to store and to multiply. L13 and L14 named the d² cost of width. Projecting down is how a model pays less while keeping the part it needs.
Tractability. Many problems are easy in the right low-dimensional view and hopeless in the raw one. Find the 2 directions that separate your classes and the classification is trivial; work in the original 1000 dimensions and it's a fog.
The second reason is one of the course's recurring laws showing up again: geometry enables generalisation. A projection is the geometric act of keeping the directions that transfer across examples and dropping the ones that don't, so careful discarding is itself what produces generalisation rather than a separate trick layered on top.
The set of directions a projection keeps has a name: a subspace. A subspace is a flat slice through a vector space that is itself a vector space: a line through the origin, a plane through the origin, or the higher-dimensional version of the same. The x-y plane sitting inside 3D space is a 2D subspace. A projection picks a subspace and maps everything onto it.
Here's the load-bearing claim. The structure that matters in real data usually lives in a subspace much smaller than the space it's written in. A 4096-dim hidden state rarely uses 4096 independent directions; L13 called this the gap between nominal and effective dimensionality. The real variation sits in a few hundred directions, and the rest is close to empty. Find that subspace and you can represent the data in far fewer numbers with almost no loss.
Put those together and a projection onto the right subspace is lossy compression that keeps the part you care about.
Image compression is the everyday example. A photo is millions of pixels. JPEG projects each block of the image onto a handful of frequency patterns, keeps the ones the eye notices, and drops the rest. The file shrinks 10× or more; the picture still looks like the picture. The compression is lossy (the discarded detail is gone for good) and useful (the structure a viewer cares about survived).
The same shape shows up everywhere a system summarises. The skill is choosing the subspace so that "what survives" and "what the task needs" are the same set.
Once you have the picture you start seeing projections in every architecture. They are the default move in these systems, common enough that you stop noticing them.
Embeddings. An embedding (L9) takes a token, an image, or a user and projects it into a few hundred learned dimensions. Everything about the item that didn't help the training objective got dropped on the way in. The embedding is what survived the projection.
Neural network layers. A layer that maps 4096 inputs to 1024 outputs is a projection with learned weights. The matrix (L14) is chosen during training so the 1024 directions it keeps are the ones the loss rewarded keeping. Feature extraction is a stack of these.
Latent representations. The internal latent space a model works in (L7) is almost always lower-dimensional than the raw input, and the structure it needs lives in a subspace of even that. "The model works in latent space" means it works in a projected, compressed view of its input.
Dimensionality reduction. When you explicitly compute the best subspace to project onto, that's dimensionality reduction. PCA, the most common version, finds the directions of greatest variation in the data and projects onto them; keep the top few, drop the rest. t-SNE and UMAP do a nonlinear version for visualisation. The vector playground build (B3) does exactly this when it squashes 384-dim embeddings down to 2D for a plot.
The clearest case is attention, which Phase 4 (L40) treats in full. Strip it to the idea and it's a projection the model computes on the fly.
A transformer layer holds a representation for every token in the sequence. For each position, attention asks: of all the information available across the sequence, which parts are relevant to this position right now? It keeps a weighted blend of the relevant parts and ignores the rest. The weights are computed from the input, so the selection changes from input to input.
This is why attention was the change that unstuck long-sequence modelling. Earlier architectures had to squeeze a whole sequence into one fixed summary and hope the right information survived. Attention lets the model choose, per position, what to keep.
"Feature extraction" sounds like a separate technique. It's the same operation under another name. L14 showed a CNN re-basing pixel space into edge space, then part space, then object space. Each step is a projection: it keeps the directions that correspond to a useful feature and drops the rest.
A face detector projects a million-pixel image down to "is there a face, and where". Almost all the pixel information is irrelevant to that question, and a good detector discards it early. The features are the surviving directions; the discarding is what makes the surviving directions meaningful.
This goes well beyond deep learning, and you've almost certainly done it.
Sensor filtering. A low-pass filter on a noisy signal is a projection onto the low-frequency subspace: keep the slow-moving signal, drop the high-frequency noise. Signal processing just names the subspace in frequency terms.
Telemetry reduction. A board streaming 200 channels rarely needs all 200 to flag a fault. Project onto the dozen channels (or the dozen linear combinations of channels) that carry the fault signature and you cut bandwidth and storage without losing the alarm.
Anomaly detection. Project normal operation onto its natural subspace; anything with a large component outside that subspace is, by definition, not normal. The size of the discarded part becomes the anomaly score.
Recommendation. A recommender projects users and items into a shared low-dimensional space (matrix factorisation is literally this), so "will this user like this item" becomes a dot product (L12) in a few dozen dimensions instead of a lookup over millions.
The instinct is the same across all of them: find the subspace where the task lives, project onto it, ignore the rest.
Every projection sets a dial. Keep more dimensions and you preserve more information, at higher cost and higher risk of keeping noise and memorising. Keep fewer and you compress harder, cheaper and more general, at the risk of throwing away something the task needed.
Capacity (L13) is the ceiling on what a representation can hold. Compression is the act of deliberately staying below that ceiling. The two pull in opposite directions, and choosing where to sit between them is a real design decision, not a default. Hold too much: expensive, noisy, prone to overfitting. Compress too hard: cheap and clean but missing the signal. The right point depends on the task and the budget, and a lot of model design is the search for it.
No projection is free. Naming what you gave up (which directions, which information) is the engineering discipline this lesson is built to install.
Projecting down is the lever that lets the same idea run from a microcontroller to a data centre. What changes across the spectrum is how aggressively you compress.
Same idea, different aggressiveness. Whether you're squeezing a classifier into 64 KB or training a 400B model, you're choosing a subspace and discarding the rest.
It doesn't tell you how to find the best subspace; that's what training, and explicit methods like PCA, are for. This lesson is about what a projection is and why it shows up everywhere.
It doesn't carry the non-linear case in full. Real models fold space with nonlinearities (L14) before projecting, so the kept subspace can be curved rather than flat. The intuition survives unchanged: keep some directions, drop the rest. The directions just aren't always straight lines.
What it does do is install a habit. When you see a representation get smaller, ask which information was kept, which was thrown away, and whether that choice fits the task. That question is the engineering instinct this lesson exists to build.
You now have arrows, geometry, dimensions, the machines that bend space, and the operation that decides what to keep. The wall has taught geometry that's deterministic so far: a vector is here, a matrix sends it there, a projection drops these directions. The next move steps into uncertainty. Models don't just compute fixed outputs; they hold beliefs and assign probabilities. L16 puts the dice rack on the wall and starts the probability run.