PHASE 2 · THE WHITEBOARD WALL

L15 · 15 / 79 visited

Projections, subspaces, and information selection

Lesson 15. Fifth station on the whiteboard wall. ~23 min read + cards + retrieval. Durability tier 1 (bedrock; a projection is selective preservation, the idea under every representation in modern AI).

📄

Memory palace · Whiteboard wall · station 15

The stack of transparent sheets, pinned next to the grid from L14. Light shines through the stack; each sheet filters out some detail, so only the marks that matter reach the wall. Where the layered grids (L13) added axes of distinction, this stack takes them away. The claim to keep: projection = selective preservation.

Core idea. A projection keeps some information about a thing and throws the rest away. A useful representation preserves what matters for the task; the skill is choosing what to drop. Most of modern AI is machinery for learning what to keep.

Why this lesson exists

L14 left you with the matrix: a machine that bends a whole space, every vector at once. That explains how a model moves a representation from one space to another. It leaves a sharper question open. When a layer turns a 4096-number hidden state into a 64-number attention head, where does the information in the other dimensions go?

It gets dropped. On purpose.

That drop is what this lesson is about. A projection takes a representation and produces a smaller or simpler one, keeping some of the information and discarding the rest. Every embedding, every attention head, every pooling layer, every feature extractor is a version of it.

The reason it matters reaches past any single operation. A system that keeps everything about its input has learned nothing; it's a copy machine. A system that keeps the right things has found structure. Deciding what to keep is most of what intelligence, biological or artificial, actually does.

What a projection is

Start with a shadow. Hold a 3D object up to a light and it casts a 2D shadow on the wall. The shadow is a projection: a map from a higher-dimensional thing (the object) to a lower-dimensional one (the outline).

The shadow keeps some information. You can often recognise a hand, a chair, a person from the shadow alone. It also loses information: depth is gone, and two different objects can cast the same shadow. That pairing, keep some and lose some, is the whole idea.

Mechanically, a projection is a transformation (often a matrix, from L14) whose output lives in fewer dimensions than its input, or in a restricted part of the same space. Apply it and the directions you chose to care about survive; the directions it was built to ignore collapse to nothing.

mechanism · a projection throws away directions Pick the directions you care about. For each input vector, keep its component along those directions; set the component along every other direction to zero. The kept directions form the subspace; everything orthogonal to them is discarded. The operation is the same whether the kept subspace is 2 directions out of 3 or 64 out of 4096.

FIG 15.1. A projection as a shadow. Each blue point drops straight down to its amber shadow on the plane (the kept subspace). The height (the z direction) is discarded. Two points at the same in-plane spot but different heights land on the same shadow: the map is many-to-one, which is exactly why it can't be undone.

Why a projection loses information

A projection can't be reversed. Once depth is gone from the shadow, no amount of staring at the wall brings it back. In matrix terms, the projection sends many different inputs to the same output: every object at a given outline maps to that outline. The map is many-to-one, and a many-to-one map has no inverse.

Worked example with small numbers. Take a 3D point (4, 7, 9). Project onto the first two axes (the x-y plane): keep x and y, drop z. The output is (4, 7). The 9 is gone. The points (4, 7, 9), (4, 7, 0), and (4, 7, −2) all land on (4, 7). Three different inputs, one output. The information that separated them lived entirely in the third coordinate, and the projection chose not to keep it.

So naming the cost is half the lesson: a projection is lossy by construction. It always loses information. The only question that matters is whether it kept the right information.

Why losing information is useful

Discarding sounds like damage. Most of the time it's the goal. Four reasons.

Noise. Real inputs carry variation you don't want: sensor jitter, lighting, phrasing, measurement error. A projection that drops the noisy directions and keeps the signal directions hands the next stage a cleaner input. Phase 1 (L7) called this invariance: a good representation is unchanged by transformations that don't matter. A projection is how invariance gets built.

Generalisation. A model that keeps every detail of its training examples memorises them. L3 made memorisation the enemy of generalisation. Dropping surface detail forces the representation toward the structure shared across examples, which is the part that transfers to inputs the model hasn't seen.

Compute and memory. A 64-dim representation costs a fraction of a 4096-dim one to store and to multiply. L13 and L14 named the d² cost of width. Projecting down is how a model pays less while keeping the part it needs.

Tractability. Many problems are easy in the right low-dimensional view and hopeless in the raw one. Find the 2 directions that separate your classes and the classification is trivial; work in the original 1000 dimensions and it's a fog.

The second reason is one of the course's recurring laws showing up again: geometry enables generalisation. A projection is the geometric act of keeping the directions that transfer across examples and dropping the ones that don't, so careful discarding is itself what produces generalisation rather than a separate trick layered on top.

What a subspace is

The set of directions a projection keeps has a name: a subspace. A subspace is a flat slice through a vector space that is itself a vector space: a line through the origin, a plane through the origin, or the higher-dimensional version of the same. The x-y plane sitting inside 3D space is a 2D subspace. A projection picks a subspace and maps everything onto it.

Here's the load-bearing claim. The structure that matters in real data usually lives in a subspace much smaller than the space it's written in. A 4096-dim hidden state rarely uses 4096 independent directions; L13 called this the gap between nominal and effective dimensionality. The real variation sits in a few hundred directions, and the rest is close to empty. Find that subspace and you can represent the data in far fewer numbers with almost no loss.

flowchart LR FS["full space
e.g. 4096 nominal dims"]:::full --> RS["relevant subspace
a few hundred real dims"]:::sub FS -.->|near-empty directions| X(["the rest
dropped with little loss"]):::dropped classDef full fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef sub fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef dropped fill:#161a22,stroke:#f87171,color:#9aa3b2,stroke-dasharray:4 3;

FIG 15.2. Where the structure lives. The data is written in a wide space, but its real variation occupies a smaller subspace. Projecting onto that subspace keeps almost everything that matters and discards directions that were nearly empty to begin with.

Compression that preserves structure

Put those together and a projection onto the right subspace is lossy compression that keeps the part you care about.

Image compression is the everyday example. A photo is millions of pixels. JPEG projects each block of the image onto a handful of frequency patterns, keeps the ones the eye notices, and drops the rest. The file shrinks 10× or more; the picture still looks like the picture. The compression is lossy (the discarded detail is gone for good) and useful (the structure a viewer cares about survived).

The same shape shows up everywhere a system summarises. The skill is choosing the subspace so that "what survives" and "what the task needs" are the same set.

Projections are everywhere in modern AI

Once you have the picture you start seeing projections in every architecture. They are the default move in these systems, common enough that you stop noticing them.

flowchart LR O["an object
image · token · user · signal"]:::obj --> F["many features
a high-dimensional description"]:::many classDef obj fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef many fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;

FIG 15.3. Anything real can be described by a large pile of features. A raw object starts life high-dimensional: every pixel, every measurable property is a coordinate.

flowchart LR M["many features
high-dimensional"]:::many --> P[["projection"]]:::proj --> Few["fewer features
what matters"]:::kept P -.->|thrown away| D(["noise · redundancy · irrelevant detail"]):::dropped classDef many fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef proj fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef dropped fill:#161a22,stroke:#f87171,color:#9aa3b2,stroke-dasharray:4 3;

FIG 15.4. The projection step. The high-dimensional description goes in; a smaller set of features comes out; the directions the projection was built to ignore are discarded. This is the move repeated all over modern AI.

Embeddings. An embedding (L9) takes a token, an image, or a user and projects it into a few hundred learned dimensions. Everything about the item that didn't help the training objective got dropped on the way in. The embedding is what survived the projection.

Neural network layers. A layer that maps 4096 inputs to 1024 outputs is a projection with learned weights. The matrix (L14) is chosen during training so the 1024 directions it keeps are the ones the loss rewarded keeping. Feature extraction is a stack of these.

flowchart LR I["input
raw, high-dim"]:::vec --> P1[["projection
(learned layer)"]]:::proj --> R["representation
compressed"]:::kept --> P2[["read-out"]]:::proj --> Y["prediction"]:::vec classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef proj fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 15.5. A model as projection then read-out. The input is projected into a compressed representation that keeps the task-relevant structure; a final read-out maps that to a prediction. The middle box is where the model does its real work, and it is smaller than the input by design.

Latent representations. The internal latent space a model works in (L7) is almost always lower-dimensional than the raw input, and the structure it needs lives in a subspace of even that. "The model works in latent space" means it works in a projected, compressed view of its input.

Dimensionality reduction. When you explicitly compute the best subspace to project onto, that's dimensionality reduction. PCA, the most common version, finds the directions of greatest variation in the data and projects onto them; keep the top few, drop the rest. t-SNE and UMAP do a nonlinear version for visualisation. The vector playground build (B3) does exactly this when it squashes 384-dim embeddings down to 2D for a plot.

Attention is information selection

The clearest case is attention, which Phase 4 (L40) treats in full. Strip it to the idea and it's a projection the model computes on the fly.

A transformer layer holds a representation for every token in the sequence. For each position, attention asks: of all the information available across the sequence, which parts are relevant to this position right now? It keeps a weighted blend of the relevant parts and ignores the rest. The weights are computed from the input, so the selection changes from input to input.

flowchart LR A["all information
every token in the sequence"]:::full --> S[["selection
(attention weights)"]]:::proj --> Rel["relevant information
weighted blend"]:::kept A -.->|near-zero weight| Ig(["ignored for now"]):::dropped classDef full fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef proj fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef dropped fill:#161a22,stroke:#f87171,color:#9aa3b2,stroke-dasharray:4 3;

FIG 15.6. Attention as selection. Of all the information available, the model keeps a weighted blend of the relevant parts and lets the rest fall to near-zero weight. The weights are computed from the input, so the kept set changes per position and per input.

mechanism · attention as a learned, input-dependent projection For each token, attention scores every other token with a dot product (L12), runs the scores through a softmax to get weights that sum to 1, and outputs a weighted sum of their values. High weight means kept; near-zero weight means discarded. Training is what shaped which patterns of relevance lower the loss. The result is a soft, learned, input-dependent projection onto the information that matters for predicting the next token.

This is why attention was the change that unstuck long-sequence modelling. Earlier architectures had to squeeze a whole sequence into one fixed summary and hope the right information survived. Attention lets the model choose, per position, what to keep.

Feature extraction as projection

"Feature extraction" sounds like a separate technique. It's the same operation under another name. L14 showed a CNN re-basing pixel space into edge space, then part space, then object space. Each step is a projection: it keeps the directions that correspond to a useful feature and drops the rest.

A face detector projects a million-pixel image down to "is there a face, and where". Almost all the pixel information is irrelevant to that question, and a good detector discards it early. The features are the surviving directions; the discarding is what makes the surviving directions meaningful.

Engineers project all the time

This goes well beyond deep learning, and you've almost certainly done it.

Sensor filtering. A low-pass filter on a noisy signal is a projection onto the low-frequency subspace: keep the slow-moving signal, drop the high-frequency noise. Signal processing just names the subspace in frequency terms.

Telemetry reduction. A board streaming 200 channels rarely needs all 200 to flag a fault. Project onto the dozen channels (or the dozen linear combinations of channels) that carry the fault signature and you cut bandwidth and storage without losing the alarm.

Anomaly detection. Project normal operation onto its natural subspace; anything with a large component outside that subspace is, by definition, not normal. The size of the discarded part becomes the anomaly score.

Recommendation. A recommender projects users and items into a shared low-dimensional space (matrix factorisation is literally this), so "will this user like this item" becomes a dot product (L12) in a few dozen dimensions instead of a lookup over millions.

The instinct is the same across all of them: find the subspace where the task lives, project onto it, ignore the rest.

The trade-off: capacity versus compression

Every projection sets a dial. Keep more dimensions and you preserve more information, at higher cost and higher risk of keeping noise and memorising. Keep fewer and you compress harder, cheaper and more general, at the risk of throwing away something the task needed.

Capacity (L13) is the ceiling on what a representation can hold. Compression is the act of deliberately staying below that ceiling. The two pull in opposite directions, and choosing where to sit between them is a real design decision, not a default. Hold too much: expensive, noisy, prone to overfitting. Compress too hard: cheap and clean but missing the signal. The right point depends on the task and the budget, and a lot of model design is the search for it.

No projection is free. Naming what you gave up (which directions, which information) is the engineering discipline this lesson is built to install.

Compute spectrum: projection is how a model fits the tier

Projecting down is the lever that lets the same idea run from a microcontroller to a data centre. What changes across the spectrum is how aggressively you compress.

microcontroller Aggressive projection: a handful of hand-chosen or learned features, often after a frequency-domain projection of the raw sensor stream. Keeping the wrong 8 features means the model fails; the subspace choice is the design.

mobile / edge Embeddings of 128 to 512 dims, small latent spaces. Dimensionality reduction is used to keep on-device indexes small and search fast within a battery budget.

workstation Hidden states of a few thousand dims, projected up and down between layers and attention heads. The projections are learned, and there's room to keep more than the edge can.

hyperscale Very wide representations, but the effective subspace is still far smaller than the nominal width. Mixture-of-experts routing is itself a selection: send each token to a few experts and ignore the rest.

Same idea, different aggressiveness. Whether you're squeezing a classifier into 64 KB or training a 400B model, you're choosing a subspace and discarding the rest.

What this lesson does and doesn't do

It doesn't tell you how to find the best subspace; that's what training, and explicit methods like PCA, are for. This lesson is about what a projection is and why it shows up everywhere.

It doesn't carry the non-linear case in full. Real models fold space with nonlinearities (L14) before projecting, so the kept subspace can be curved rather than flat. The intuition survives unchanged: keep some directions, drop the rest. The directions just aren't always straight lines.

What it does do is install a habit. When you see a representation get smaller, ask which information was kept, which was thrown away, and whether that choice fits the task. That question is the engineering instinct this lesson exists to build.

compression · what to carry forward

A projection keeps some information about a thing and discards the rest. It's a many-to-one, lossy, non-reversible map.
Losing information is usually the point: it removes noise, forces generalisation, saves compute, and makes hard problems tractable.
A subspace is the set of directions a projection keeps. Real data's useful structure usually lives in a subspace much smaller than the full space.
A projection onto the right subspace is lossy compression that preserves the structure the task needs (JPEG is the everyday example).
Embeddings, neural-network layers, latent spaces, and feature extraction are all projections with learned weights.
Attention is a soft, learned, input-dependent projection onto the information relevant to each position.
Dimensionality reduction (PCA and relatives) computes the best subspace to project onto explicitly.
The trade-off is capacity versus compression: keep more and pay more and risk noise; keep less and save but risk losing signal.

What you should be able to do now

Explain what a projection is and why it loses information, using the shadow picture.
Give a worked example of a many-to-one projection and say exactly what was discarded.
Say why discarding information is useful, in terms of noise, generalisation, compute, and tractability.
Define a subspace and explain why useful structure usually lives in a small one.
Connect embeddings, neural layers, feature extraction, and attention to the single idea of projection.
Describe attention as soft, learned, input-dependent information selection.
State the capacity-versus-compression trade-off and what's at stake on each side.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace mechanism; don't summarise.

L15 A colleague says: "compression always damages the model, so we should keep as many dimensions as possible." Push back from this lesson's view.

Compression onto the right subspace is the point, not the damage. A representation that keeps every dimension keeps the noise, the redundancy, and the surface detail of the training examples; that's expensive to store and multiply, and it pushes the model toward memorising rather than generalising. Deliberately dropping directions removes noise (the invariance idea from L7), forces the representation onto the structure shared across examples (which is what transfers, per L3), and cuts the d² compute cost named in L13 and L14. The honest framing is the capacity-versus-compression trade-off: keeping more preserves more information but costs more and risks noise and overfitting; keeping less is cheaper and more general but risks dropping signal. A projection always loses information; the question that matters is whether it kept the task-relevant information, not whether it kept the most information. "Keep everything" is the failure mode, not the safe default.

L15 A layer maps a 4096-dim hidden state to a 64-dim one. Where did the information in the other dimensions go, and when is that fine versus a problem?

It was discarded. The layer is a projection: it keeps each input vector's components along 64 learned directions and collapses everything orthogonal to those directions to nothing. The map is many-to-one, so many different 4096-dim states now map to the same 64-dim output, and the difference between them is gone and unrecoverable. That's fine when the discarded directions carried noise, redundancy, or variation irrelevant to the task: the 64 kept directions still span the subspace the task lives in, so nothing the next stage needs was lost. It's a problem when the 64 kept directions don't span the task-relevant subspace: then two inputs that should be treated differently have been merged, and no later layer can pull them apart because the information that separated them is no longer present. The whole game is choosing (through training) the 64 directions so that "what survived" and "what the task needs" line up.

L15 Explain attention as a projection. What's kept, what's discarded, and what makes it different from a fixed projection like dropping the z-coordinate?

For a given token, attention scores every token in the sequence with a dot product (L12), runs the scores through a softmax to get weights that sum to 1, and outputs a weighted blend of their values. Tokens with high weight are kept; tokens with near-zero weight are discarded for that position. So far it's a projection onto the information relevant to this token. The difference from a fixed projection like "drop z" is that the weights are computed from the input itself, so the kept subspace is not fixed in advance: it's learned during training, and it changes from position to position and from input to input. Dropping z always discards the same direction. Attention discards a different blend every time, chosen on the fly. That's why it's described as a soft (weighted, not all-or-nothing), learned (the relevance patterns came from training), input-dependent projection. The flexibility is what let attention solve the long-sequence problem that fixed-summary architectures couldn't.

↩ L7 L7 said a good representation is invariant to transformations that don't matter (lighting, phrasing, position). Connect invariance to projection. How does one produce the other?

Invariance is the discarding half of representation, and a projection is the mechanism that does the discarding. If you want a representation that doesn't change when the lighting changes, you find the direction (or subspace) along which lighting varies and project it out: keep everything else, set the lighting component to zero. After that projection, two images that differ only in lighting map to the same representation, which is exactly what "invariant to lighting" means. The same move builds invariance to phrasing in text or to small translations in images. This links straight back to why discarding is useful: an invariance is a deliberate many-to-one collapse, and L3's point was that collapsing away irrelevant variation is what forces the model onto the structure that generalises. So L7's "a good representation discards what doesn't matter" and this lesson's "a projection keeps some directions and drops the rest" are the same statement: invariance is the goal, projection is how you reach it, and the engineering question is making sure the directions you projected out really were the ones that didn't matter.

↳ ahead You'll meet PCA and dimensionality reduction soon. Predict what PCA has to compute, and why it connects to "effective dimensionality" from L13.

PCA has to compute the directions along which the data varies most, the principal components, and then project onto the top few of them. Concretely it finds an ordered set of orthogonal directions: the first captures the largest spread in the data, the second the largest spread left over after the first is accounted for, and so on. You keep the top k and drop the rest, which is choosing a k-dimensional subspace to project onto. This connects directly to effective dimensionality (L13): if the data's real variation is concentrated in a few hundred directions out of a nominal 4096, then the first few hundred principal components carry almost all the spread and the remaining components are nearly flat (near-zero variance). Keeping the top components keeps almost all the structure while dropping the near-empty directions, which is the "structure lives in a small subspace" claim made concrete. PCA is the explicit, computed version of the thing learned projections do implicitly: find where the data actually lives, and keep that.

Next station

You now have arrows, geometry, dimensions, the machines that bend space, and the operation that decides what to keep. The wall has taught geometry that's deterministic so far: a vector is here, a matrix sends it there, a projection drops these directions. The next move steps into uncertainty. Models don't just compute fixed outputs; they hold beliefs and assign probabilities. L16 puts the dice rack on the wall and starts the probability run.

← Lesson 14 Lesson 16 →