Lesson 18. Eighth station on the whiteboard wall. ~23 min read + cards + retrieval. Durability tier 1 (bedrock; entropy is the single number for how much uncertainty a distribution carries).
🌫️
Memory palace · Whiteboard wall · station 18
The fog machine, hissing over a network of paths past the funnel from L17. When the fog is thick, many routes stay plausible and you can't tell which one you'll take; when it clears, one obvious route remains. Keep the mapping: entropy = how much uncertainty remains; fog = uncertainty; visibility = predictability.
Core idea. A distribution describes what might happen. Entropy measures how uncertain you still are about what will happen. It is one number, a property of the whole distribution, and it is highest when every outcome is equally likely and zero when one outcome is certain.
Why this lesson exists
L17 gave you the distribution: the weights spread across all possible outcomes. Two distributions can both be valid and still feel completely different. Compare a coin weighted 99 to 1 against a fair coin at 50/50. Both are real distributions over two outcomes. One is almost a foregone conclusion; the other is the most unpredictable two-outcome situation there is.
That difference is real and worth measuring. You need a single number that says how much uncertainty a distribution carries, so you can compare "almost decided" against "wide open" on the same scale. That number is entropy. The 99/1 coin has very low entropy; the fair coin has the highest entropy a two-outcome distribution can have.
Uncertainty made visible
The first thing to fix is where the uncertainty lives. It belongs to the distribution, before the outcome is known, not to the eventual result.
A weather forecast of "60% rain, 40% dry" is uncertain whether or not it actually rains. Once the day plays out you know the answer, but the forecast itself, the distribution, carried uncertainty when it was made. A fault-prediction model that says a part has a "30% chance of failing this month" is uncertain regardless of whether the part fails. A recommender that spreads its weight across ten items you might want is more uncertain than one that puts almost all its weight on a single item.
Entropy measures that, the uncertainty sitting inside the distribution, not the outcome that eventually lands. The same forecast has the same entropy on a day it rains and a day it doesn't.
flowchart LR
O["observation e.g. clear sky, stable pressure"]:::obs --> Y["highly predictable outcome one outcome dominates"]:::sure
classDef obs fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef sure fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 18.1. Low uncertainty. When the observation makes one outcome dominate, the distribution is peaked and there's little left to guess. Low entropy.
flowchart LR
O["observation e.g. mixed, changeable conditions"]:::obs --> A["outcome A"]:::poss
O --> B["outcome B"]:::poss
O --> C["outcome C"]:::poss
O --> D["outcome D"]:::poss
classDef obs fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef poss fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
FIG 18.2. High uncertainty. When many outcomes stay roughly equally plausible, the distribution is spread out and there's a lot left to guess. High entropy. The entropy is a feature of this fan of possibilities, not of whichever outcome finally happens.
Surprise
The way into entropy is a smaller idea: surprise. A rare event is surprising; a common one is not. If something you expected with 99% probability happens, you learned almost nothing, because you already knew it was coming. If something you gave 1% probability happens, that's a jolt; you've learned a lot.
Make that precise without an equation yet. The surprise of an outcome grows as its probability shrinks. A near-certain outcome carries almost no surprise. A near-impossible outcome, if it happens, carries a lot. Surprise here is a technical quantity, a measurable amount of information, not an emotion the system feels.
Now the definition you can carry. Entropy is the average surprise of a distribution: how surprised you should expect to be, on average, when the outcome is revealed. A distribution where one outcome is nearly certain has low average surprise, because the common, unsurprising outcome is what almost always shows up. A distribution where everything is equally likely has high average surprise, because whatever happens, you couldn't have called it.
The shape of uncertainty
Because entropy is a property of the whole distribution, you can learn to read it off the shape. A sharp peak on one outcome is low entropy. A broad, even spread is high entropy. The flatter the distribution, the higher the entropy, with the perfectly flat (uniform) distribution at the maximum for that number of outcomes.
flowchart TD
D["peaked distribution"]:::head --> A["outcome A · 0.94"]:::big
D --> B["outcome B · 0.03"]:::small
D --> C["outcome C · 0.02"]:::small
D --> E["outcome D · 0.01"]:::small
classDef head fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef big fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
classDef small fill:#161a22,stroke:#9aa3b2,color:#9aa3b2;
FIG 18.3. A peaked distribution. One outcome holds nearly all the weight, so the result is easy to call and little surprise remains. Low entropy.
flowchart TD
D["flat (uniform) distribution"]:::head --> A["outcome A · 0.25"]:::eq
D --> B["outcome B · 0.25"]:::eq
D --> C["outcome C · 0.25"]:::eq
D --> E["outcome D · 0.25"]:::eq
classDef head fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef eq fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
FIG 18.4. A flat distribution. Every outcome is equally likely, so nothing can be predicted better than chance and whatever happens is a full surprise. Maximum entropy for this number of outcomes.
FIG 18.5. The shape of uncertainty. The same four outcomes, arranged three ways. A peaked distribution carries little uncertainty (here about 0.4 bits); an even spread carries the most (a uniform distribution over four outcomes is exactly 2 bits, since log₂4 = 2). You can read entropy off the shape: flatter is higher.
Compression, revisited
Phase 1 (L2) made a claim that can now be sharpened: prediction and compression are the same problem. Entropy is the hinge that joins them.
Compression works by giving short codes to likely things and long codes to rare things. That only pays off if some outcomes are more likely than others. If every outcome is equally likely (high entropy), there's no skew to exploit, every code ends up about the same length, and you can barely compress at all. If a few outcomes dominate (low entropy), the short codes cover most of the data and the file shrinks. Shannon's result makes this exact: entropy is the floor on the average number of bits per symbol any compressor can achieve.
flowchart LR
EN["entropy how much uncertainty"]:::a <--> PR["prediction how well you can guess"]:::b
PR <--> CO["compression how few bits you need"]:::c
CO <--> EN
classDef a fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef b fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef c fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 18.6. One triangle, three views. Low entropy means the outcome is predictable, which means it compresses well. High entropy means it's hard to predict and hard to compress. They are three readings of the same underlying quantity: how much structure the distribution has.
So prediction reduces uncertainty in a measurable way. A model that predicts well is one whose distribution puts heavy weight on what actually happens, which is a low-entropy, low-surprise distribution about the outcome. Getting better at prediction, raising the predictability of the outcome, and lowering the entropy you face are the same move.
Entropy inside AI systems
Read a model's output as a distribution (L17) and entropy becomes a running readout of how sure the model is.
Take a language model. Give it "The capital of France is" and its next-token distribution is sharply peaked: almost all the weight lands on one token, and the rest of the vocabulary is nearly ruled out. That's a low-entropy prediction. Give it "Write the first line of a story about" and the next-token distribution is broad: thousands of tokens are plausible openers, weight is spread thin, and the entropy is high. The model isn't more or less capable in the two cases; the uncertainty that genuinely remains is different, and entropy measures it.
The same reading applies across AI. A classifier that puts 0.98 on one class has a low-entropy output (confident); one that splits 0.3/0.3/0.4 across three classes has high entropy (unsure). A recommender's entropy says whether it has a strong single guess or a wide field. In every case entropy quantifies the uncertainty remaining in the model's prediction, on a scale you can compare across inputs and across models.
The formal definition, lightly
With the intuition in place, the equation is just bookkeeping for "average surprise." This is Shannon entropy, the standard measure, written in bits when the logarithm is base 2:
H(X) = −Σ pi log2 pi
Read it piece by piece, no derivation needed. The term −log2 pi is the surprise of outcome i, measured in bits: a near-certain outcome (p close to 1) has surprise near 0, a rare one has large surprise. The pi in front weights that surprise by how often the outcome actually occurs. The sum adds up "how often it happens, times how surprising it is when it does," across every outcome. That sum is the average surprise, which is the entropy.
Two checks that the formula matches the intuition. A certain outcome (p = 1) has surprise −log2 1 = 0 and contributes nothing, so a distribution with one sure outcome has entropy 0. A fair coin has entropy −(0.5 log2 0.5 + 0.5 log2 0.5) = 1 bit, the most two outcomes can carry, while the 99/1 coin works out to about 0.08 bits. The 12-fold drop from 1 bit to 0.08 bits is the number behind "the fair coin is far more uncertain." Entropy is highest when probability is spread evenly and falls as it concentrates; it isn't the rare outcomes that make entropy large, it's the lack of any dominant one.
Why entropy matters
Entropy is one of the most useful measures of uncertainty ever written down, and it shows up wherever information is moved, stored, or predicted.
Communication. Entropy sets the limit on how few bits a message source needs, which is the foundation of all of digital communication. Compression. Every compressor (zip, JPEG, audio codecs) is chasing the entropy floor. Prediction and machine learning. The standard training loss, cross-entropy, measures how far the model's distribution sits from reality in exactly these units; lowering it is lowering surprise on real data. Scientific measurement. Entropy quantifies how much an experiment is expected to tell you, guiding what's worth measuring. Information systems. Anomaly detection, channel design, and error correction all reason in terms of how much uncertainty is present and how much a signal removes.
Entropy is not randomness, and not single-event uncertainty
Two distinctions keep the idea sharp. Entropy is not the same as randomness. Randomness is that outcomes aren't fixed in advance; entropy is how much uncertainty the distribution carries. A heavily loaded die is still random, yet its entropy is low because you can predict it well. Randomness is the condition; entropy is the amount.
Entropy is also not the uncertainty of a single event. A single outcome has a surprise (−log p); entropy is the average of those surprises across the whole distribution. Surprise is per-outcome; entropy is per-distribution. Keeping those apart is what lets you say a rare event is highly surprising while the distribution it came from has low entropy.
Compute spectrum: entropy as a readout
Entropy is cheap to compute from a distribution you already have, so it travels across the whole spectrum as a diagnostic.
microcontroller
The entropy (or just the top probability) of a tiny classifier's output decides whether to trust the call or fall back to a safe default. One number, almost free.
mobile / edge
Per-token entropy gates on-device generation: low entropy, commit; high entropy, slow down, ask, or hand off to a larger model.
workstation
Entropy of predictions guides active learning (label the high-entropy examples first) and flags inputs the model is unsure about.
hyperscale
Cross-entropy is the training loss for frontier models, and average entropy per token is a headline diagnostic during pretraining. Driving it down across billions of tokens is most of what the compute is buying.
What this lesson does and doesn't do
It doesn't build out information theory; mutual information, channel capacity, and the coding theorems are a field of their own, and this is the intuition they all rest on. It doesn't derive the entropy formula from axioms; the point here is to read what it measures, not to prove it's the only measure that could.
What it does do is give you a single, comparable number for uncertainty. Carry that, and "the model is confident here and unsure there," "this signal is compressible and that one isn't," and "this experiment is worth running" all become the same measurement.
compression · what to carry forward
Entropy is the average surprise of a distribution: how uncertain you are about the outcome before it's known. One number, a property of the whole distribution.
Surprise of an outcome grows as its probability shrinks; a certain outcome has zero surprise, a rare one has a lot.
A peaked distribution has low entropy (predictable); a flat, uniform distribution has the maximum entropy for its number of outcomes; a certain outcome has entropy 0.
Worked anchors: fair coin = 1 bit, 99/1 coin ≈ 0.08 bits, uniform over N outcomes = log₂N bits.
Entropy, prediction, and compression are one triangle: low entropy means predictable means compressible; entropy is the floor on bits per symbol.
A model's output entropy is a readout of its remaining uncertainty: "capital of France" is low entropy, an open creative prompt is high entropy.
Entropy is not randomness (a loaded die is random but low entropy) and not single-event surprise (entropy averages surprise over the distribution).
Shannon entropy H = −Σ p log₂ p, in bits, is "average surprise" written down; cross-entropy, the standard training loss, measures the same thing against reality.
What you should be able to do now
Say what entropy measures and why it's a property of the distribution, not the eventual outcome.
Explain why a 99/1 distribution has lower entropy than a 50/50 one, in terms of average surprise.
Recognise low, medium, and high entropy from the shape of a distribution.
Explain the link between entropy, prediction, and compression.
Read a language model's next-token entropy as its remaining uncertainty, and say why "capital of France" is low entropy.
Interpret each piece of H = −Σ p log₂ p without deriving it.
Distinguish entropy from randomness and from the surprise of a single event.
Flashcards
Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.
Retrieval practice
Write before you reveal. Trace mechanism; don't summarise.
L18 Explain why a 99/1 distribution has lower entropy than a 50/50 distribution, reasoning from average surprise rather than the formula.
Entropy is the average surprise you should expect when the outcome is revealed. For the 99/1 coin, the 99% outcome happens almost every time and carries almost no surprise (you already expected it), while the 1% outcome is very surprising but almost never occurs, so it rarely gets to contribute. Averaging "how often times how surprising," the common, unsurprising outcome dominates, and the average surprise is tiny, about 0.08 bits. You can predict "the 99% outcome" and be right almost always. For the 50/50 coin, neither outcome is more expected than the other, so every flip is a genuine surprise and there's no way to predict better than chance; the average surprise is as high as two outcomes allow, exactly 1 bit. So the skew in 99/1 is what lowers the entropy: a dominant, unsurprising outcome pulls the average down, whereas an even split keeps every outcome maximally surprising. The key is that entropy depends on the whole distribution's shape, not on the rare event in isolation; the 1% event is individually very surprising, but its rarity means it barely moves the average.
L18 Explain entropy using the fog-machine memory-palace object. What do the fog, the visibility, and the clearing correspond to?
The fog machine fills a network of paths, and the fog stands for uncertainty about which route the outcome will take. Thick fog means many paths stay plausible at once: you can't see which way things will go, lots of outcomes remain live, and that is high entropy. As the fog thins, fewer routes remain visible, until in clear air one obvious route stands out and the rest fall away, which is low entropy. So the amount of fog is the amount of uncertainty remaining before you know the outcome, which is exactly what entropy measures. Visibility maps to predictability: clear air (low entropy) means you can predict the route well; dense fog (high entropy) means you can't. Crucially, the fog is a property of the whole network of paths at that moment, not of the single route eventually taken, just as entropy is a property of the distribution, not of the outcome that lands. Gaining information (an observation that rules some paths out) is the fog clearing: it lowers entropy by making the route more visible.
↩ L2 Interleaved. L2 said prediction and compression are the same problem. Using entropy, explain why lower entropy generally improves compressibility.
Compression works by spending few bits on likely things and more bits on rare things: give the common symbols short codes and the rare ones long codes, and the total shrinks. That trade only pays off when some outcomes are more likely than others, which is exactly what low entropy means. In a low-entropy source a handful of outcomes dominate, so the short codes cover most of the data and the average code length drops well below the naive "one fixed-length code per symbol." In a high-entropy source (near-uniform), no outcome is more likely than any other, so there's no skew to exploit; every code ends up about the same length and there's almost nothing to compress. This is why L2 said prediction and compression are the same problem: predicting well means assigning high probability to what comes next, which is a low-entropy, skewed distribution, and that skew is precisely what a compressor turns into short codes. Shannon's source coding theorem makes the connection exact: the entropy of the source is the lower bound on the average bits per symbol any lossless compressor can reach. So a better predictor implies lower entropy implies better compression, all three the same underlying fact about how much structure the data has.
L18 A teammate says "high entropy just means the system is random." Correct them, and separately distinguish entropy from the surprise of a single event.
Randomness and entropy are different things. Randomness means the outcome isn't fixed in advance; entropy measures how much uncertainty the distribution carries. A heavily loaded die is fully random, every roll is a chance event, yet its entropy is low because one face dominates and you can predict it well. A fair die is also random but high entropy, because no face is favoured. So "random" is the condition and "entropy" is the amount; you can have randomness with low entropy. Calling high entropy "just randomness" loses the part that matters, which is how spread out the distribution is. The second distinction: entropy is not the uncertainty of a single event. A single outcome has a surprise, minus log of its probability, which is large for rare outcomes and zero for certain ones. Entropy is the average of those surprises across the entire distribution, weighted by how often each outcome occurs. So surprise is a per-outcome quantity and entropy is a per-distribution quantity. That is exactly why a rare event can be intensely surprising while the distribution it came from has low entropy: the rare event's big surprise is multiplied by its tiny probability, so it barely moves the average.
↳ ahead Modern AI spends enormous compute driving prediction error down. Using entropy, say precisely what "improving prediction" means, and why there's a limit to it.
Improving prediction means making the model's distribution put more weight on what actually happens, which lowers the average surprise the model experiences on real data. That quantity has a name, cross-entropy: it measures how far the model's predicted distribution sits from reality, in the same bits-of-surprise units as entropy, and training is the process of driving it down. So "a better predictor" is precisely "a lower-cross-entropy model," not a vague notion of smartness. There's a hard floor, though. Some of the uncertainty in the data is irreducible (the aleatoric uncertainty from L16): even a perfect model can't predict a genuinely fair coin better than 50/50, so the cross-entropy can't fall below the true entropy of the data itself. More compute and data shrink the gap between the model's cross-entropy and that floor (the reducible, epistemic part), but they can't push below the floor. This sets up the next stretch of the wall: if the achievable error has a floor and compute keeps lowering the gap to it, why does pouring in more compute keep paying off, and how much? That question, how prediction improves as you scale, is where Phase 2 heads next, toward optimisation and the compute-scaling intuition that closes the phase.
Next station
You now have a number for uncertainty. The model's job, stated cleanly, is to lower the surprise it faces on real data, the cross-entropy between its distribution and the world. How a model actually drives that number down is the next sketch: it walks downhill on the loss, one small step at a time, which is what gradients and optimisation are. Further along the wall, the phase asks why adding more compute keeps lowering the surprise a model can reach. L19 puts the slope and valley on the wall.