PHASE 2 · THE WHITEBOARD WALL

L17 · 17 / 79 visited

Distributions, sampling, and possible futures

Lesson 17. Seventh station on the whiteboard wall. ~23 min read + cards + retrieval. Durability tier 1 (bedrock; a distribution is the landscape of possibilities, and sampling is how a system draws one outcome from it).

📊

Memory palace · Whiteboard wall · station 17

The funnel and bins, standing past the dice rack from L16. A clear funnel pours thousands of coloured balls into a row of bins. Some bins fill fast, others barely at all; the pattern they pile into is the distribution. Pulling one ball out is a sample. A valve on the funnel sets how widely the balls scatter, which is temperature. Keep two things: distribution = the landscape of possibilities; sampling = drawing one point from it.

Core idea. A probability tells you how likely one outcome is. A distribution tells you the whole landscape of possible outcomes and their weights. Modern AI reasons with the landscape: a language model computes a distribution over every possible next token, then draws one from it.

Why this lesson exists

L16 gave you the single probability: one number for how likely one outcome is. It also said the real object was the distribution, the weights spread across all outcomes. This lesson takes that object seriously, because it's the thing modern AI actually computes and acts on.

Here's the claim that surprises people. A language model doesn't pick the next word and then tell you. At every step it produces a probability over every token in its vocabulary, tens of thousands of weights at once, and then a separate step draws one. The fluent sentence you read is a sequence of draws from a sequence of landscapes. Understanding that landscape, and what it means to draw from it, is most of what "generative AI" means under the hood.

From one probability to a whole landscape

Start with the contrast. A single probability answers one yes-or-no-ish question: how likely is this one outcome? A distribution answers the whole question at once: for every possible outcome, how much weight does it carry? The weights are all non-negative and they add up to 1, because something has to happen.

flowchart LR E["one event
e.g. it rains tomorrow"]:::obs --> P["a probability
0.20"]:::sure classDef obs fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef sure fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 17.1. A single probability. One event, one number. Useful, and narrow: it tells you nothing about the other outcomes or how the weight is spread.

flowchart LR PO["all possible outcomes
sun · cloud · rain · storm · snow"]:::obs --> D["a probability distribution
0.30 · 0.25 · 0.20 · 0.08 · 0.05 ... (sum 1)"]:::kept classDef obs fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 17.2. A distribution. Every outcome gets a weight, and the weights sum to 1. This is the landscape: the full spread of what could happen and how plausible each outcome is. The most likely outcome is just the tallest part of it.

The shape of that landscape is the information. A sharp peak on one outcome means near-certainty. A broad, flat spread means wide-open uncertainty. Two strong peaks (a bimodal shape) means the situation could go one of two clearly different ways, which is very different from a single fuzzy guess in the middle. None of that survives if you keep only the top outcome.

Why the landscape beats a single number

A distribution carries everything a point answer throws away: the runner-up outcomes, how close they are, how confident the system is, whether the uncertainty is narrow or wide. L15 made the general point that collapsing a rich object to one summary discards information. A distribution collapsed to its single most likely outcome is that same loss, applied to a belief.

This matters because downstream decisions need the shape. A forecast that says "rain, 0.55" and one that says "rain, 0.95" both have rain as the top outcome, and they call for completely different plans. The number that separates them lives in the distribution, not in the label.

Outcome space and the random variable

Two pieces of vocabulary make the rest precise. The outcome space is the complete set of outcomes the distribution is defined over: the six faces of a die, every word in a model's vocabulary, every image of a fixed size. A random variable is a quantity whose value is uncertain, described by a distribution over that space. "Tomorrow's high temperature" and "the next token" are random variables: not yet fixed, but with some values more probable than others.

The size of the outcome space shapes the problem. A coin has two outcomes. A language model has its whole vocabulary, often 50,000 or more, as the outcome space at every single step. An image generator's outcome space is astronomically large, which is part of why generating a convincing image is harder than sorting one into a category.

Distributions describe possible futures

Read the landscape forward in time and it becomes a map of possible futures. The present state doesn't fix a single next state; it makes some next states likely and others rare. The distribution is how a system holds all of those futures at once, each weighted by how plausible it is.

flowchart LR S["current state
what we observe now"]:::vec --> F1["future A
likely"]:::kept S --> F2["future B
plausible"]:::kept S --> F3["future C
unlikely"]:::poss S --> F4["future D
rare"]:::poss classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef poss fill:#161a22,stroke:#9aa3b2,color:#9aa3b2;

FIG 17.3. Many possible futures. One present state branches into many futures with different weights. The distribution is the full set of branches and their probabilities, not a single chosen branch.

FIG 17.4. The funnel and the bins. Each ball is one sample. Drop thousands and they pile into a shape: tall bins are likely outcomes, short bins are rare ones. The pile's shape is the distribution. The highlighted ball is a single draw following its own path. The funnel's valve sets how widely the balls scatter, which is exactly what temperature does to a model's distribution before it samples.

Sampling: drawing one outcome from the landscape

Sampling is the act of drawing one outcome from a distribution so that each outcome turns up with probability equal to its weight. An outcome with weight 0.30 is drawn about 30% of the time; an outcome with weight 0.01 turns up about once in a hundred draws. One draw gives you one outcome. The distribution is the landscape; a sample is one point picked from it.

flowchart LR D["distribution
weights over outcomes"]:::kept --> SMP[["sample
(draw, weighted by probability)"]]:::model --> O["one outcome
this draw"]:::vec classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee; classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee;

FIG 17.5. Sampling. The distribution goes in; one outcome comes out, chosen with probability equal to its weight. Run the draw many times and the outcomes reproduce the distribution; run it once and you get a single, partly-random result.

Two things follow, and keeping them apart is the point of this lesson. The distribution is fixed once the model has computed it. Which outcome a single draw returns is not fixed; it's random, weighted by the distribution. So the same distribution produces different outcomes on different draws, while still respecting the weights over many draws.

Why generative AI samples, and why the same prompt varies

Now the headline use. At each step, a language model maps the prompt-so-far to a distribution over its whole vocabulary. Decoding is the step that turns that distribution into one token, and it has a choice: take the single highest-weight token every time (greedy decoding), or sample from the distribution.

flowchart LR PR["prompt
tokens so far"]:::vec --> M[["model"]]:::model --> D["distribution over tokens
a weight on every vocab entry"]:::kept --> SMP[["sample / decode"]]:::model --> T["one token
appended, then repeat"]:::vec classDef vec fill:#1d2230,stroke:#38bdf8,color:#e6e8ee; classDef model fill:#1d2230,stroke:#f59e0b,color:#e6e8ee; classDef kept fill:#1d2230,stroke:#4ade80,color:#e6e8ee;

FIG 17.6. Next-token prediction. The model's output is the distribution over tokens. Decoding draws one token from it, appends it, and the loop runs again. The model computes the landscape; decoding chooses the point.

This is why the same prompt can produce different answers. If decoding samples, two runs draw different tokens from the same landscape, and small differences early compound into different sentences. Use greedy decoding (always take the top token) and the output becomes repeatable, at the cost of variety. The model didn't change between runs; the draw did.

The knob that tunes this is temperature. Before the distribution is formed, the model's raw scores are divided by the temperature, then passed through the softmax. A low temperature sharpens the distribution toward its top outcome, so decoding is more deterministic and more repetitive. A high temperature flattens it toward uniform, so decoding roams more widely and produces more varied, riskier output. At temperature zero, the distribution collapses onto its single most likely outcome and sampling becomes greedy decoding. In the funnel picture, temperature is the valve: it sets how widely the balls scatter before they land.

Why randomness can be useful

Deliberate randomness buys things a deterministic system can't get. Variety: a creative-writing assistant that samples gives different drafts instead of the same one forever. Coverage: sampling several answers and keeping the best (used in code generation and reasoning) explores more of the space than a single greedy pass. Exploration: an agent that always takes its current best guess never discovers a better option, which is why reinforcement learning (L6) injects randomness on purpose. Estimation: drawing many samples is how engineers compute quantities that are hard to work out directly, the Monte Carlo method.

One caution, because it's a common confusion. The intelligence is in the shape of the distribution the model learned, not in the coin flip that draws from it. Sampling from a well-trained distribution gives good, varied output. Sampling from a flat or badly-trained one gives noise. Randomness is a tool applied to a learned landscape; on its own it produces nothing useful.

Distributions across machine learning

Once you read outputs as distributions, the field lines up.

Classification. A classifier's softmax is a distribution over classes. Keeping the whole vector lets you threshold, abstain, or report the runner-up, instead of committing to the top label.

Generative models. Generation is sampling from a learned distribution over outputs. Text models sample tokens; image models sample images. Diffusion models (treated later) start from random noise and denoise it step by step, so the random seed is what makes two images from the same prompt differ.

Uncertainty estimation. A distribution that's spread out is the model saying it isn't sure. Ensembles produce a distribution of answers whose disagreement measures doubt. The spread is the signal.

Probabilistic models. Whole families of models (mixture models, Bayesian networks, latent-variable models) are built directly as distributions you can sample from and update. Reasoning over them is reasoning over distributions.

Engineers reason about distributions, not certainties

Outside AI, the same habit runs through every field that has to plan against an uncertain future.

Weather. A forecast comes from an ensemble: run the model many times from slightly perturbed starting conditions and read off the distribution of outcomes. "70% chance of rain" is a summary of that distribution.

Demand forecasting. A supply chain plans against a distribution of possible demand, not a single number, so it can size inventory for the busy case without drowning in stock for the quiet one.

Anomaly detection. Build a distribution of normal behaviour; an anomaly is an outcome that sits in a low-probability region of it. The rarer the outcome under the model, the louder the alarm.

Recommendation. A recommender holds a distribution over what you might want next and ranks items by their weight, rather than betting everything on one guess.

Risk and reliability. Finance models a distribution of possible losses and reads risk off its tail. Reliability engineering models a distribution of failure times (the Weibull distribution is the workhorse) and schedules maintenance against it. In both, the tail of the distribution, the rare bad outcome, is the part that matters most.

A prediction is a distribution

Pull the thread together. When a system predicts, the honest output is a distribution over what could happen. A single predicted value, the most likely outcome, the average, the chosen token, is a summary of that distribution, useful for acting but lossy. The systems that handle uncertainty well are the ones that keep the distribution around long enough to use its shape, and only collapse it to a single choice at the last moment, when an action is actually required.

Why this prepares you for entropy

You now have a way to represent uncertainty: the distribution. The obvious next question writes itself. If a distribution represents uncertainty, how much uncertainty does it contain? A sharp peak is almost no uncertainty; you can predict the outcome and rarely be surprised. A flat spread is a lot; every outcome is roughly as likely as any other and you can't predict well. That single quantity, how much uncertainty a distribution carries, is entropy, and it's the next sketch on the wall.

Compute spectrum: the cost of sampling

Computing a distribution and drawing from it costs compute, and how much you spend scales with the tier.

microcontroller Often greedy decoding or a single point prediction; sampling machinery is a luxury. The distribution may be computed but collapsed to the top outcome immediately to save cycles.

mobile / edge Temperature and top-k sampling are cheap once the distribution exists, so on-device text generation can still feel varied. The cost is dominated by producing the distribution, not by the draw.

workstation Room to sample many outputs and keep the best, or to run small ensembles for an uncertainty estimate. Drawing several full generations is the main extra cost.

hyperscale Large-scale sampling (many candidates per query, big ensembles, diffusion with many denoising steps) is routine, because the value of better or more diverse output justifies the repeated draws.

What this lesson does and doesn't do

It doesn't catalogue named distributions or their formulas; the Gaussian, the categorical, and friends are tools you'll reach for when a specific problem needs them, and their shapes matter more than their equations here. It doesn't cover how a model learns the right distribution; that's training, several phases away.

What it does do is fix the mental model: a model's output is a landscape, sampling is one draw from it, and temperature reshapes the landscape before you draw. Carry that, and next-token prediction, image generation, and uncertainty estimation all read as the same move.

compression · what to carry forward

A probability is the weight on one outcome; a distribution is the weights on all outcomes at once, summing to 1. The distribution is the real object.
The shape of a distribution (peaked, flat, bimodal) carries the spread, the alternatives, and the confidence that a single number throws away.
The outcome space is the full set of possible outcomes; a random variable is an uncertain quantity described by a distribution over that space.
Sampling draws one outcome with probability equal to its weight. One draw is one outcome; many draws reproduce the distribution.
A language model computes a distribution over its whole vocabulary at each step; decoding draws one token (greedy takes the top, sampling draws by weight).
The same prompt can give different answers because decoding samples; temperature reshapes the distribution first (low = peaked and repeatable, high = flat and varied, zero = greedy).
Randomness is useful for variety, coverage, exploration, and estimation, but the intelligence is in the learned distribution, not in the draw.
A prediction is a distribution; a single predicted value is a lossy summary of it. Entropy, next, measures how much uncertainty the distribution holds.

What you should be able to do now

State the difference between a single probability and a distribution, and why the distribution carries more.
Define the outcome space and a random variable, and identify them for a coin, a die, and an LLM.
Explain what sampling is and why one draw differs from the distribution it came from.
Describe what a language model actually outputs at each step and what decoding then does.
Explain why the same prompt can give different answers, and what temperature changes.
Give a reason randomness is useful and say why randomness is not the same as intelligence.
Explain why a distribution sets up the entropy question that comes next.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write before you reveal. Trace mechanism; don't summarise.

L17 A colleague says "an LLM predicts the next word." Make that precise using this lesson. What does the model actually produce, and where does the single word come from?

The model doesn't produce a word; it produces a probability distribution over its entire vocabulary, a weight on every token, that sums to 1. That distribution is the actual output of the forward pass. A separate decoding step then turns the distribution into one token: greedy decoding takes the single highest-weight token, while sampling draws a token at random with probability equal to its weight. The visible word is one draw (or one argmax) from the landscape, not the thing the model computed. This is why the same prompt can produce different words on different runs when sampling is used: the distribution is the same, but the draw isn't. It's also why temperature changes the output without changing the model: temperature divides the raw scores before the softmax, sharpening the distribution at low values (more repeatable) or flattening it at high values (more varied), and at temperature zero the draw collapses to greedy decoding. So "predicts the next word" is shorthand for "computes a distribution over all possible next tokens, then a decoding step selects one."

L17 Distinguish three things that get blurred: a single probability, a distribution, and sampling. Why does keeping them separate matter?

A single probability is one number: the weight on one outcome (rain has weight 0.30). A distribution is the whole set of weights across every outcome in the outcome space, all non-negative and summing to 1; it's the landscape, and the single probability is one point on it. Sampling is an action, not a number: drawing one outcome from the distribution so that each outcome appears with probability equal to its weight. Keeping them separate matters because they sit at different stages. The distribution is what the model computes and holds, its full belief about what could happen. Sampling is what turns that belief into one concrete output when the system has to act. Conflating the distribution with a single probability makes you think the model only knows about one outcome, when it actually weighs all of them. Conflating the distribution with sampling makes you think the model "decided" one answer, when really it computed a landscape and a separate, partly-random step picked a point. Most confusion about generative AI ("why did it say something different this time", "is it making things up") clears up once you see that the model produces a distribution and decoding draws from it.

L17 Give two reasons deliberate randomness is useful in an AI system, then explain why randomness is not the same as intelligence.

Useful reasons (any two): variety, so a generator gives different drafts rather than the same output every time; coverage, sampling several candidates and keeping the best explores more of the space than one greedy pass, which helps in code and reasoning tasks; exploration, an agent that always takes its current best guess never finds a better option, so reinforcement learning injects randomness on purpose; estimation, drawing many samples (Monte Carlo) approximates quantities that are hard to compute directly. Why randomness isn't intelligence: the intelligence lives in the shape of the distribution the model learned, which encodes what's plausible given the input. Sampling is just a draw from that shape. Draw from a well-trained distribution and you get good, varied output; draw from a flat or badly-trained distribution and you get noise. The randomness contributes diversity, not competence. So randomness is a tool applied to a learned landscape; the landscape is where the knowledge is, and a fair coin with no distribution behind it produces nothing useful. Treating the draw itself as the smart part gets the mechanism exactly backwards.

↩ L8 L8 said a token is a discrete unit drawn from a fixed vocabulary. Connect that to this lesson's outcome space and next-token distribution.

The fixed vocabulary from L8 is exactly the outcome space for next-token prediction. At each step, the random variable is "the next token," and the model assigns a probability to every token in that vocabulary, producing a distribution over the whole outcome space that sums to 1. So the vocabulary size sets the size of the outcome space: a 50,000-token vocabulary means a 50,000-way distribution computed at every step. Decoding then samples or argmaxes one token from it, and that token is appended and fed back in, so the next step's distribution is conditioned on it. This also shows why tokenisation (L8) shapes the whole probabilistic picture: the choice of what counts as a token decides what outcomes the model is even allowed to put weight on. A rare word split into three subword tokens is three draws from three distributions, not one; a common word that's a single token is one draw. The geometry from earlier lessons and the probability from this one meet here: the model maps a sequence of token embeddings to a distribution over the same token vocabulary, step after step.

↳ ahead The next lesson measures how much uncertainty a distribution holds (entropy). A model's next-token distribution can be peaked or flat, and temperature reshapes it. Predict how entropy changes as you raise the temperature, and what that means for the output.

Raising the temperature flattens the distribution toward uniform, which spreads the weight more evenly across tokens. A more even spread means more uncertainty about which token comes next, so entropy goes up. Lowering the temperature sharpens the distribution toward its top token, concentrating the weight, so there's less uncertainty and entropy goes down; at temperature zero the distribution is effectively a single spike and the entropy of the choice is about zero, because the outcome is determined. In output terms: high temperature, high entropy means varied, surprising, sometimes incoherent text, because the model is drawing from a wide range of plausible-and-less-plausible tokens. Low temperature, low entropy means safe, repetitive, predictable text, because it keeps picking near the top. So temperature is, quite literally, a dial on the entropy of the distribution you sample from, which is why the next lesson's measure of uncertainty connects straight back to the knob you met here. Entropy will make "how much uncertainty" precise, and you'll see it is highest for the uniform distribution and zero for a certain one.

Next station

You can now represent uncertainty as a distribution and draw from it. The question that opens the next sketch is the obvious one: if a distribution represents uncertainty, how much uncertainty does it actually contain? Answering that with a single number, and seeing why that number is the natural measure of information, is what comes next. L18 switches on the fog machine over the network of paths.

← Lesson 16 Lesson 18 →