Lesson 16. Sixth station on the whiteboard wall. ~23 min read + cards + retrieval. Durability tier 1 (bedrock; probability is the language a system uses to reason when it cannot know the outcome in advance).
🎲
Memory palace · Whiteboard wall · station 16
The dice rack, standing beside the stack of sheets from L15. Each die shows a different possible future; no single die tells you which one happens. Read the whole rack together and you get a spread of possibilities with different weights. The claim to keep: probability = uncertainty made measurable.
Core idea. Probability is the language for reasoning under uncertainty. A prediction worth trusting reports how likely each outcome is; the single most likely answer is one reading of that fuller picture. Intelligence shows up as making good decisions while the outcome is still uncertain.
Why this lesson exists
The wall so far has been deterministic. A vector sits at a point (L11). A matrix sends it somewhere definite (L14). A projection drops a set of directions and keeps the rest (L15). Feed the same input twice and you get the same output twice.
Real systems don't get to live there. A projection threw information away, so the model is now missing some of what it would need to be sure. Sensors are noisy. The future depends on things you never observed. Even a perfect representation of everything you can see leaves the outcome you care about partly open.
So the course needs a language for the gap between what a system knows and what's true. That language is probability. Dice and card tables are where most people first meet it, which makes it look like a topic about games. Its real job in engineering and AI is wider: it's how a system reasons when it cannot know the answer in advance, which is almost always.
Why uncertainty exists
Three sources, and they stack.
Incomplete information. You never observe everything. L15 made this concrete: a projection keeps a subspace and discards the rest, so by the time an input reaches the model, detail has already been dropped. Whatever lived in the discarded directions is now a source of doubt. The same happens at the sensor: a camera sees pixels, not the object; a thermometer reads one point, not the whole system.
Noise. Measurements wobble. The same temperature read twice gives two slightly different numbers. Noise means even the information you do keep is only approximately right.
The future isn't fixed by the present you can see. Two days with identical readings can end differently, because the outcome depended on something outside the readings. A model conditioned on the observation can't separate those two days; they look the same to it, so the honest output covers both.
flowchart LR
O["observation"]:::obs --> Y["one outcome known for sure"]:::sure
classDef obs fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef sure fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 16.1. Certain knowledge. The fantasy case: an observation maps to exactly one outcome. A few engineered systems get close to this (a checksum, a lookup in a clean table), but it is the exception.
flowchart LR
O["observation incomplete · noisy"]:::obs --> A["outcome A"]:::poss
O --> B["outcome B"]:::poss
O --> C["outcome C"]:::poss
O --> D["outcome D"]:::poss
classDef obs fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef poss fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
FIG 16.2. Uncertain knowledge, the normal case. One observation is consistent with several outcomes. The system can't collapse them to one, because the information that would separate them isn't present. Probability is how it tracks all of them at once.
Two kinds of uncertainty
It helps to split uncertainty by where it comes from, because the two kinds behave differently when you try to reduce them.
mechanism · reducible vs irreducible uncertaintyEpistemic uncertainty comes from the model not knowing enough yet: too little data, a representation that's too small, a region of the input space it has barely seen. More and better data shrinks it. Aleatoric uncertainty comes from the situation itself: genuine randomness, or causes that were never observed. No amount of extra data removes it, because the outcome truly isn't determined by what's available. A good system drives epistemic uncertainty down and reports the aleatoric floor honestly instead of pretending it's zero.
A coin flip you can't see the physics of is mostly aleatoric: even a perfect model lands at 50/50. A rare disease the model has seen only twice is mostly epistemic: the doubt is in the model, and more cases would sharpen it. Most real problems mix the two, and telling them apart is half of uncertainty estimation.
Probability as a measure of belief
A probability is a number between 0 and 1 that says how strongly an outcome is expected. 0 means ruled out, 1 means guaranteed, and the interesting values sit in between. Read it as a dial of belief: 0.9 is "expected, but I've left room to be wrong"; 0.5 is "could go either way".
There are two honest ways to read that number, and they agree in practice. The frequency reading: a probability is the fraction of times the outcome happens over many similar trials. The belief reading: a probability is how confident you are given what you know, updated as evidence arrives. Weather forecasts use the first; a doctor weighing a diagnosis from symptoms uses the second. AI systems use both, depending on whether you're counting outcomes in data or scoring a hypothesis against evidence.
The course leans on the belief reading, because that's what a model is doing: holding a weighted opinion about outcomes it can't see, and revising it as inputs change. Uncertainty is just the flip side of that opinion: the more spread out the belief, the more uncertain the system is.
Outcomes, events, and the distribution
Three words make the rest precise. An outcome is one possible result (it rains tomorrow). An event is a set of outcomes you care about (it rains or snows). A distribution is the full assignment of probability across every outcome, and its values add up to 1 because something has to happen.
The distribution is the real object. A single prediction (the most likely outcome) is a summary of it. The shape carries everything useful: where the weight sits, how spread out it is, whether there are two strong candidates or one. A system that reports only the top outcome has thrown the shape away, which is the same loss-of-information move from L15, applied to its own belief.
FIG 16.3. The distribution is the answer. A real forecast (or a classifier's output) spreads probability across outcomes. The tallest bar is the most likely, not the certain, outcome. A certain prediction would be a single full bar and the rest at zero, which real systems almost never produce. Picking only the tallest bar discards the spread, and the spread is the uncertainty.
flowchart LR
PF["possible futures A · B · C · D"]:::poss --> ASSIGN[["assign weights"]]:::model --> RL["relative likelihoods 0.50 · 0.30 · 0.15 · 0.05 (sum 1)"]:::kept
classDef poss fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
FIG 16.4. A distribution turns a list of possible futures into relative likelihoods that sum to 1. The weights say which futures the system expects more, without ever promising one.
Likelihood versus certainty
Two words that get blurred. Likelihood, in everyday use, is how probable an outcome is: a high-likelihood outcome carries a lot of the distribution's weight. Certainty is the extreme case where one outcome has probability 1 and the rest have 0. Almost nothing a real system predicts is certain; the useful skill is ranking outcomes by likelihood and acting on the ranking.
One honesty note for later. Statistics gives "likelihood" a narrower, technical meaning: the probability of the data you actually observed, read as a function of a hypothesis that might have produced it. That sense is the engine behind fitting models to data, and it returns when machine learning lands. For this lesson, the everyday meaning (how probable an outcome is) is enough.
Why predictions are probabilistic
Put the pieces together and the shape of an AI prediction follows. The input is incomplete and noisy. The target isn't fully determined by the input. The honest output is therefore a distribution over outcomes, not a single guaranteed one.
flowchart LR
I["input image · text · sensors"]:::vec --> M[["model"]]:::model --> P["probability distribution over classes or tokens"]:::kept --> AM(["argmax one answer, optional"]):::faint
classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
classDef faint fill:#161a22,stroke:#9aa3b2,color:#9aa3b2,stroke-dasharray:4 3;
FIG 16.5. What a model actually returns. The model maps the input to a probability distribution. Reading off the single highest-probability answer (the argmax) is an optional last step that throws away everything the distribution was telling you about the alternatives and the confidence.
This is literally what the machinery does. A classifier ends in a softmax (Phase 1) that turns raw scores into a distribution over classes. A language model produces, at every step, a distribution over the whole vocabulary, and generation picks or samples from it. The "answer" you see is a choice made on top of a distribution the model computed first.
Confidence is not certainty
A model's confidence in an answer is the probability it assigned to that answer: 0.92 confidence means the top class got 0.92 of the weight. Confidence is useful and it is not the same as being right.
The check is calibration. A model is well calibrated when its confidence matches its accuracy: of all the times it says 0.70, it's correct about 70% of the time. A calibrated 0.92 is trustworthy. The trouble is that large neural networks are often overconfident: they say 0.99 and are right only 85% of the time, because nothing in plain training forced the numbers to mean what they claim. High confidence from an uncalibrated model is a number, not a guarantee.
This is the mechanism under a failure you already met. L10 described models producing confidently wrong output that looks like capability. In this language, that's a model placing high probability on a fluent continuation that happens to be false: confident, and wrong, because its distribution was never anchored to truth. Confidence reports the shape of the model's belief, not the state of the world.
How modern AI reasons about uncertainty
Once you read every model output as a distribution, the field's tools line up.
Classification probabilities. The softmax vector is the model's belief over classes. Keep the whole vector and you can threshold, abstain, or pass the runner-up to a human.
Confidence scores. The top probability, used to decide whether to act, ask, or escalate. Only meaningful if the model is calibrated, which is why calibration gets measured separately.
Next-token prediction. The distribution over the vocabulary at each step. Temperature and top-k (the next lesson) are knobs on how sharply or loosely you sample from it.
Uncertainty estimation. Techniques that try to recover the epistemic part: ensembles (train several models, disagreement signals doubt), Monte Carlo dropout, Bayesian layers. The aim is a model that says "I don't know" where it genuinely doesn't, instead of guessing confidently.
Out-of-distribution inputs. When an input is unlike anything in training, the honest answer is high uncertainty. Detecting that case, rather than emitting a confident label anyway, is one of the harder open problems.
Deciding under uncertainty
Knowing the distribution is not the end; a system usually has to act. The pattern is the same everywhere: turn observations into beliefs, then turn beliefs into a decision that manages risk.
flowchart LR
O["observations partial · noisy"]:::vec --> B["beliefs probabilities over states"]:::kept --> D["decision act to manage risk"]:::model
D -.->|new observations| O
classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;
classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee;
classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee;
FIG 16.6. The loop. Observations form beliefs (a distribution over what's true); the decision acts on those beliefs and the cost of being wrong; acting produces new observations that update the beliefs. This is the same input-representation-decision-feedback loop from L1, now with belief made explicit.
The decision weighs likelihood against cost. A 5% chance of engine failure and a 5% chance of a dropped video frame are the same probability and call for opposite responses, because the cost of being wrong is wildly different. Good decisions under uncertainty are about the distribution and the stakes together.
Uncertainty is unavoidable in real systems
Every engineering field that has to predict something runs on probability, usually without calling attention to it.
Weather. "70% chance of rain" means that across many days that look like this one, it rains on roughly 70% of them. Forecasts come from ensembles: run the model many times with slightly perturbed starting conditions and read off how often each outcome appears.
Fault prediction. A board or a turbine reports a probability of failure within a window, not a verdict. Maintenance is scheduled against that probability and the cost of an unplanned outage.
Medical diagnosis. A test has a sensitivity and a specificity; a positive result shifts the probability of disease without settling it. The number that matters (probability of disease given a positive test) depends on how common the disease is, which is why screening rare conditions produces so many false alarms.
Autonomous systems. A self-driving stack holds a probability distribution over where each object is and where it's heading, fuses noisy sensors into that belief, and plans against it. The belief is the product; the steering is downstream.
Recommendation and risk. A recommender estimates the probability you'll engage with an item; a credit or insurance model estimates the probability of default or claim. Both rank by probability and set thresholds against business cost.
Why uncertainty can't be driven to zero
You can shrink uncertainty, and a lot of engineering is exactly that: better sensors, more data, bigger models cut the epistemic part. What you can't do is remove the aleatoric floor. When the outcome genuinely isn't fixed by the information available, no model and no extra data make it certain, because the missing piece was never an information problem in the first place.
So the goal isn't certainty. It's an honest distribution: epistemic uncertainty pushed as low as the data allows, and the irreducible part reported rather than hidden. A model that claims certainty it doesn't have is more dangerous than one that admits the spread, because every decision downstream trusts the number.
Compute spectrum: how much uncertainty machinery you can afford
Tracking uncertainty costs compute, so how much you do scales with the tier.
microcontroller
Usually a single point prediction plus a fixed threshold. A crude confidence (the top probability) is often all there's room for. The aleatoric floor is handled by conservative thresholds, not by modelling.
mobile / edge
Calibrated softmax probabilities and simple abstention ("ask the cloud if confidence is low"). Temperature scaling, a cheap post-hoc calibration step, fits here easily.
workstation
Room for ensembles or Monte Carlo dropout, so epistemic uncertainty can be estimated directly rather than guessed at. Useful where being wrong is expensive.
hyperscale
Large ensembles, Bayesian approaches, and uncertainty-aware training. The cost of a confident mistake at scale justifies spending compute to measure doubt.
What this lesson does and doesn't do
It doesn't teach you to compute probabilities by counting cases; combinatorics is a tool for later and isn't the point here. It doesn't settle the frequency-versus-belief debate; both readings are used, and they agree where it matters.
What it does do is reset the default. From here on, read every prediction as a distribution, ask whether the confidence is calibrated, and separate the doubt you can reduce from the doubt you can't. That habit is what the rest of the maths on this wall is built to support.
compression · what to carry forward
Probability is a number from 0 to 1 measuring how strongly an outcome is expected; it reads as either long-run frequency or degree of belief, and the two agree in practice.
Uncertainty exists because information is incomplete (projections and sensors discard it), measurements are noisy, and the outcome isn't fixed by what's observed.
Epistemic uncertainty (limited data or model) shrinks with more data; aleatoric uncertainty (inherent randomness or unobserved causes) does not.
A distribution assigns probability across all outcomes and sums to 1; it is the real output, and the single most likely answer is just a summary of it.
Predictions are probabilistic because the honest answer to an underdetermined question is a distribution.
Confidence is the probability a model assigns its answer; it equals trustworthiness only when the model is calibrated, and large networks are often overconfident.
Modern AI runs on distributions: softmax over classes, distributions over tokens, confidence scores, ensembles and Bayesian methods for uncertainty estimation.
The goal is a good decision under uncertainty, weighing likelihood against the cost of being wrong, not the removal of uncertainty.
What you should be able to do now
Explain why a system can't be certain even with a perfect representation of what it can see.
Read a probability as a measure of belief or of long-run frequency, and say when each reading fits.
Distinguish epistemic from aleatoric uncertainty and give an example of each.
Define a distribution and explain why the single most likely outcome is only a summary of it.
Explain why a classifier and a language model both output distributions, and what argmax discards.
Separate confidence from certainty, and say what calibration checks.
State why uncertainty can be shrunk but not eliminated, and why an honest distribution beats a false certainty.
Flashcards
Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.
Retrieval practice
Write before you reveal. Trace mechanism; don't summarise.
L16 A colleague says a good model should give one definite answer, and that a probability distribution is just hedging. Push back from this lesson's view.
The distribution is the honest output, and the single answer is a summary of it. The model's input is incomplete and noisy, and the outcome isn't fully determined by what it can see, so more than one outcome is genuinely consistent with the input. Reporting one answer doesn't make the others impossible; it just hides them. The spread carries real, usable information: how confident the model is, whether there's a close second outcome, whether to act or escalate. A decision-maker downstream needs that spread, because a 0.51 top answer and a 0.99 top answer call for different actions even though the label is the same. There's also a hard floor: aleatoric uncertainty means that for many questions a definite answer would be overconfident by construction, since the outcome isn't fixed by the available information. So a distribution isn't hedging; it's the model declining to claim certainty it doesn't have. Collapsing to one answer is the same information-discarding move from L15, applied to the model's own belief, and it's only safe when the cost of being wrong is low.
L16 Separate confidence from certainty with a worked example. What is calibration, and why does overconfidence matter in a real system?
Confidence is the probability the model assigns its chosen answer; certainty would be probability 1. A classifier that outputs 0.92 for "cat" is 92% confident, which is not the same as being right 92% of the time. Calibration is the test that ties the two together: a model is well calibrated when, across all the cases it labels with confidence 0.70, it is actually correct about 70% of the time. A calibrated 0.92 can be trusted as roughly a 92% chance; an uncalibrated one can't. The failure to watch for is overconfidence: large networks often emit 0.99 while being right only, say, 85% of the time, because ordinary training optimises for the right label, not for the probability meaning what it says. Why it matters: any decision rule built on the confidence inherits the error. A medical triage system that treats 0.99 as near-certain will under-refer cases it should have flagged; an autonomous system that trusts an overconfident "road clear" acts on a guarantee that isn't there. So confidence is only as useful as the calibration behind it, which is why calibration gets measured and corrected (for example by temperature scaling) separately from accuracy.
L16 Why is uncertainty unavoidable? Distinguish the two kinds and give a concrete example of each, then say what each one responds to.
Uncertainty is unavoidable because systems act on incomplete, noisy information about outcomes that aren't fully fixed by that information. It splits by source. Epistemic uncertainty is doubt in the model: it hasn't seen enough data, or its representation is too small, or the input lands in a region it barely covers. Example: a diagnostic model that has seen a rare disease only twice is unsure mostly because it lacks examples. Epistemic uncertainty responds to more and better data, a bigger model, or broader coverage; it shrinks. Aleatoric uncertainty is doubt in the situation: genuine randomness, or causes that were never observed, so the outcome truly isn't determined by what's available. Example: predicting a fair coin you can't measure the physics of stays at 50/50 no matter how much data you gather. Aleatoric uncertainty responds to nothing on the data side; it's the irreducible floor. Most real problems mix both, and the practical skill is reducing the epistemic part as far as the data allows while reporting the aleatoric part honestly instead of pretending it's zero.
↩ L10 L10 said models produce confidently wrong output that looks like capability (hallucination). Re-explain that failure using probability, confidence, and calibration.
A language model produces, at each step, a probability distribution over the vocabulary and then samples or picks from it. Hallucination is what happens when that distribution puts high probability on a continuation that is fluent and statistically likely given the training data, but false in the world. The model's confidence (the probability mass on that continuation) is high, so the output sounds assured. The problem is that this confidence was never anchored to truth: training rewarded predicting likely next tokens, not true ones, so the distribution reflects what text usually looks like, not what's correct. In calibration terms, the model is badly miscalibrated on factual claims it hasn't grounded: its stated confidence doesn't match its real accuracy on those claims. That's why L10 called the output "confidently wrong, looking like capability": it's confident because the probability is high, wrong because the probability tracks plausibility rather than fact, and convincing because fluency and confidence read to us like knowledge. The fixes from L10 follow naturally: grounding the distribution in retrieved facts or tool output gives the probabilities something true to track, and measuring calibration tells you when the confidence can be trusted.
↳ ahead The next lessons turn distributions into entropy and information. Predict what entropy should measure about a distribution, and why a uniform distribution should be the most uncertain.
Entropy should measure how spread out a distribution is, that is, how much uncertainty it carries on average, or equivalently how surprised you expect to be when the outcome arrives. A sharply peaked distribution (one outcome near probability 1) carries almost no uncertainty: you can predict the result and you're rarely surprised, so entropy should be near zero. A uniform distribution (every outcome equally likely) is the most uncertain case for a given number of outcomes: nothing is favoured, every result is equally surprising, and there's no way to bet better than chance, so entropy should be at its maximum. So entropy should rise as probability gets more evenly spread and fall as it concentrates, peaking at the uniform distribution and bottoming out at certainty. This is exactly the quantity the next lessons build, and it's why entropy becomes the natural measure of information and the backbone of the cross-entropy loss that trains classifiers and language models: training pushes the model's distribution toward the true one, which is the same as reducing the surprise it experiences on real data.
Next station
The dice rack gives you a distribution, but it doesn't yet say how to read the whole landscape of outcomes or how to draw one from it. That's the next sketch: distributions as the shape of all possibilities, sampling as the draw, and temperature as the knob that reshapes the landscape before you draw. L17 pours balls through the funnel into its bins.