PHASE 1 · FOUNDATIONS OF INTELLIGENCE

04 / 78

Emergence, thresholds, and capability formation

Lesson 4. Phase 1: Foundations of intelligence. ~24 min read + cards + retrieval. Durability tier 1 (bedrock).

✨

Memory palace · Bench · station 4

The dust suspended in sunlight above the bench. Hidden structure becoming visible when the light catches it; gradual accumulation crossing a visibility threshold; complex behaviour built from many tiny interactions. Revisited at station 74.

Core idea. Emergence is what happens when continuous change inside a scaling system pushes some externally measured capability across a threshold, making behaviour look sudden even though the underlying mechanism never was.

Why this lesson exists

A neural net trained to predict the next token can do 3-digit addition. Smaller nets in the same family cannot. The capability arrived somewhere along the scaling curve. Looked at from outside, the model "learned arithmetic" at a particular size.

Two reactions are common. The first is mystical: the model crossed a hidden boundary and acquired something the smaller siblings lacked. The second is dismissive: it's an evaluation trick and the whole effect is benchmark theatre.

Both skip the mechanism. The mechanism is what this lesson installs.

Two distinct things bundled under one word

Call a behaviour emergent when it is absent at smaller scale, present at larger scale, with a visibly sharp transition between the two. The sharpness lives on whatever axis you happen to be measuring. The question is what's happening inside.

The first thing is real mechanism change at scale. The model's internal state crosses a point where some algorithm becomes implementable in the weights, or where optimisation settles into a different basin of the loss surface, or where a representation gets sharp enough to support a downstream subskill. The capability appears because something inside the system genuinely changed.

The second is measurement-shaped sharpness. The underlying capability rose smoothly, but the benchmark scoring is thresholded (a 4-digit answer is either right or wrong, no partial credit), so the score stays at zero until the partial capability is good enough to clear the bar, then jumps. Same internal trajectory, different visible curve.

Both happen. Both are real. Treating them as the same thing is the source of most confusion in this area.

A picture from physics: percolation

Drop conducting beads at random into an insulating substrate. At low density, the beads form isolated clusters and current cannot get across. Past a critical density, a connected path threads the substrate end-to-end and the substrate conducts. The transition is sharp on the conductance axis. The density that produces it is continuous, and the mechanism (beads touching beads) is fully understood.

A scaling neural net has something of this flavour. As parameters, data, and optimisation steps increase, the representations the network can form expand in capacity. At some scale, the representation needed for a particular subskill becomes reachable in the available representation capacity. Below that scale, the optimiser cannot fit the subskill no matter how long it runs; above it, the optimiser can find it. Reachability is binary at the algorithm level. The capacity that produces reachability is continuous.

The analogy stops at decoration. It does not predict where a given threshold sits for a given subskill, and it shouldn't be pushed past its framing role. The point is that "continuous input, discontinuous observable" is a known engineering shape, not something specific to AI. Physicists call the general pattern a phase transition.

Hardware example: branch predictors

A branch predictor with a 1-bit history cannot catch a pattern of period 4. It thrashes against the pattern at roughly 50% accuracy. Two bits of history, still useless. Push the history past the pattern's period and the predictor goes from useless to nearly perfect on that pattern class, almost overnight. The history grew linearly; the observable performance jumped at a specific point.

Continuous resource change, threshold-shaped capability appearance, fully understood mechanism. This is a threshold effect running on silicon, in a budget of a few hundred bits, and it's been shipping in CPUs for decades.

Hardware example: caching hierarchies

A cache slightly too small for a workload's working set thrashes. Hit rate is low. Add a few percent of capacity, crossing the working set size, and hit rate climbs from low to near-complete in a narrow band. Capacity changed smoothly. Observed performance changed sharply.

Same shape as the predictor example. Same shape as the percolation example. Same shape, it turns out, as much of what gets reported as "emergent" in scaling neural nets.

The LLM case

In-context learning is a worked example. A small transformer effectively cannot use prompt examples to perform tasks it wasn't directly trained on. A large transformer can. Somewhere along the scaling curve, the network develops attention patterns that route information from earlier in the context to influence later predictions in task-specific ways. The capacity to encode that routing in the weights, and the optimisation signal that pushes the weights toward it, both grow with scale. Below a certain combined budget, the routing isn't representable in the available capacity or isn't reachable from the optimiser's starting point. Above it, both conditions clear. The behaviour appears.

Chain-of-thought prompting has the same shape. Small models cannot meaningfully use the extra inference-time compute a reasoning chain offers, because they don't have the internal abstractions needed to chain steps reliably. Larger models do. The capability is gated by what the representations can support, not by anything in the prompt itself.

Optimisation basins

Picture the loss surface as terrain. A smaller model's parameter space supports a smaller set of distinct minima the optimiser might end up in. A larger model supports more, and qualitatively different, basins. Some of those basins correspond to algorithmic strategies smaller models had no room for. Scaling does not only make existing minima deeper. It adds basins that weren't reachable before.

The right question becomes not "did the model learn X" but "did the optimiser, given this scale and this data, fall into the optimisation basin that implements X." Sometimes the basin exists at the new scale and the optimiser misses it. Sometimes the basin only exists past a particular scale. Sometimes the basin is reachable and used, but the visible capability stays hidden behind a thresholded measurement. All three look the same from outside if you're squinting at a benchmark score.

RL strategy formation

AlphaGo's smaller policy networks played positionally reasonable Go without surprising moves. AlphaGo Zero and later AlphaZero, with larger networks and richer self-play data, played moves human masters had not produced in tournament play, including move 37 of game 2 against Lee Sedol. The strategy space the larger policy networks could represent included plans the smaller networks couldn't. The optimisation procedure (self-play, MCTS-guided policy improvement) settled into different basins as capacity grew.

Neither the policy nor the discovery was mystical. The capacity to represent the strategy and the gradient signal to find it showed up together. The system did not "decide" to invent move 37. The optimiser fell into a basin that smaller networks did not contain.

Measurement effects

The Schaeffer et al. result (2023) is the cleanest worked example of measurement-shaped emergence. Several "emergent" capabilities reported on large LLMs softened or vanished when the benchmark scoring was switched from thresholded (binary correctness on full answers) to continuous (per-token log-likelihood). The capability had been rising the whole time. The binary metric couldn't see it until enough was there to clear the all-or-nothing bar.

This does not erase all emergence claims. Some capabilities still show non-linear curves on continuous metrics. It does mean that any claim of the form "capability X appeared suddenly at scale Y" should be cross-checked against a continuous metric before being treated as load-bearing. Apparent suddenness from continuous underlying change is the default explanation. Real mechanism change is the case you have to demonstrate.

Hardware gates which emergences are observable

Scaling depends on memory bandwidth between chips, interconnect across racks, and training stability over weeks of wall-clock time. A capability that would appear at 70B parameters is academic if you cannot afford to train it. A capability hidden behind 10 trillion training tokens is unreachable if the data pipeline can't sustain the throughput.

The set of emergences economically reachable in any given year is bounded by the chips available, the bandwidth across them, and the stability of training runs at the relevant scale. The hardware does not generate the emergence. It gates which emergences are observable.

This is the systems-engineering frame on the topic. Capability formation in modern AI is not a story about software alone. It's a story about what specific arrangements of silicon, memory, and interconnect happen to make tractable in a given year.

Internal smooth, external sharp

Figure 4.1 separates the two views. The top plot puts the smooth internal curve and the sharp external curve on the same axes so the gap between them is visible. The bottom plot shows the optimisation-basin picture: scale doesn't only sharpen existing minima, it expands the set of basins the optimiser can reach.

FIG 4.1. Top: a smooth internal capability curve (green) and a thresholded external benchmark curve (red) on the same scale axis. Both end high; the visible shape differs. Bottom: a loss landscape with four basins of increasing depth. Progressively larger scales reach progressively more basins. B4 is only reachable past a scale threshold and implements a strategy smaller models could not.

The L1, L2, L3 view

In L1's system loop, scaling enriches the representation: more parameters, more positions in the embedding space, more capacity for the internal state to encode regularities. In L2's terms, a larger system can encode shorter descriptions of more complex regularities, so prediction improves where smaller systems were forced to memorise. In L3's terms, the abstractions that survive distribution shift become reachable at scales they weren't reachable at before. Emergence is what those three together look like from outside when the visible measurement is a thresholded benchmark on a compositional subskill.

What this lesson does not say

It does not say that all emergent claims are real. It does not say that all apparent emergence is artefact. It does not say the model is "waking up." It does not say consciousness sits anywhere in the picture.

The mechanism, where it's been chased down, has always been a combination of representation capacity, optimisation dynamics, and the shape of the measurement. The phenomenon is real. The explanation is mechanistic. The open questions (which capabilities are sharp by mechanism versus sharp by measurement, which optimisation basins are reachable at which scales, how compositional subskills combine into emergent behaviours) are interpretability questions. They are not mystical ones.

The takeaway

Complex systems can develop qualitatively new capabilities through scaling and optimisation pressure without violating mechanistic explanation. Continuous resource change can produce threshold-shaped capability appearance. The threshold can sit in the algorithm becoming representable, in the basin becoming reachable, or in the metric becoming sensitive enough. The honest engineering question is always: which of those, and how would I check.

The blank page (L2) taught you that prediction is the engine. The folded map (L3) taught you that the engine only matters if what it builds inside survives the journey out of the training distribution. The dust caught in sunlight teaches you that what survives the journey can change in kind when there is enough of it, and that the change is mechanistic even when it looks magical.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L4 A research team reports that a large language model gains the ability to perform 4-digit multiplication "suddenly" at a particular scale, while smaller models in the same family score zero on the benchmark. Outline 3 mechanistically distinct explanations for what could be happening, and one experiment you would run to distinguish them.

Three candidates. (1) Real representation capacity threshold: 4-digit multiplication needs the model to internally encode carry operations across digit positions in a way that demands a certain depth and width of attention plus MLP capacity. Below the scale where that algorithm becomes implementable in the weights, the model cannot do it at all. Above the scale, the algorithm fits, and gradient descent finds it. (2) Optimisation reachability: the basin in the loss surface that implements multiplication exists at smaller scales too, but the optimiser doesn't reach it given the available training compute. At larger scale the basin attracts a wider neighbourhood of initialisations, so the same training procedure now finds it. (3) Measurement-shaped sharpness: the model's underlying ability to predict correct answer tokens has been rising smoothly; the benchmark scores binary correctness on the entire 4-digit string, which requires every digit right at once. Per-digit accuracy of 0.9 still scores 0.66 on the binary metric, but 0.95 scores 0.81 and 0.99 scores 0.96. The benchmark curve looks sharp; the underlying curve is smooth. Discriminator: switch the metric to per-token log-likelihood of the correct answer string and re-evaluate every checkpoint along the scaling curve. If the curve smooths out, the effect was measurement-shaped (case 3). If the curve still shows a sharp break, the change is in the model itself (case 1 or 2), and the next probe is interpretability: look at which attention heads activate on multiplication prompts before and after the break to localise where the new algorithm sits.

L4 Three claims, each common in the wild. (a) "Large LLMs are emergent at chain-of-thought arithmetic." (b) "Mixture-of-experts models suddenly route domain-specific tokens to specific experts past a certain training step." (c) "Vision models trained at scale suddenly generalise to occluded object categories they were never trained on." For each, name the most plausible mechanism, the most plausible benchmark-artefact explanation, and a check that would discriminate between the two.

(a) Mechanism: chain-of-thought benefits from internal attention pathways that can hold and update intermediate reasoning state across steps; smaller models lack the depth or width to maintain coherent intermediate state. Artefact: chain-of-thought outputs are scored by final-answer correctness; a model that gets 60% of intermediate steps right will still fail every binary test. Discriminator: evaluate intermediate-step accuracy directly with per-step log-likelihood or human grading of reasoning steps. (b) Mechanism: the routing network's gating decisions reach a stable specialisation regime once enough gradient updates differentiate the experts; below that, gating is noisy. Artefact: routing entropy is measured per-batch, and small differences in averaging can produce visible step changes. Discriminator: plot per-expert input distributions on held-out data across many checkpoints with consistent batching; a smooth decline in cross-expert overlap is the smoking gun for gradual specialisation, while a sharp jump suggests a real bifurcation in gating dynamics. (c) Mechanism: representations of occluded versions of seen categories share enough features with the unoccluded training examples that abstraction transfer kicks in only when the representation has enough capacity to factor visible and occluded versions together. Artefact: the test set may include occlusions whose distribution differs across model sizes' evaluation runs, or the metric may be top-1 accuracy on an all-or-nothing test. Discriminator: evaluate on multiple occlusion levels and report per-occlusion accuracy curves, not aggregated top-1.

↳ L5 (Forward interleave to L5, learning paradigms.) Across the four learning paradigms (supervised, unsupervised, self-supervised, reinforcement), which one would you expect to show the cleanest, most-easily-measured emergent capability transitions, and which the messiest? Sketch your reasoning in terms of what each paradigm's optimisation signal can shape, and what kinds of basins it can reach.

Cleanest: supervised. The optimisation signal is directly tied to a measurable target (label or token), and emergent capability transitions in supervised models can usually be traced to a specific subskill the loss function rewards more sharply at scale. The basins reached are the ones the labelled data shapes; the visible transitions track an externally defined error metric closely. Messiest: reinforcement. The optimisation signal is sparse, delayed, and depends on the model's own exploration. Emergent strategy formation in RL (AlphaZero-style discovery, in-context tool use) can appear sharply, but the basin that produced it is downstream of the policy's whole exploration history, which is hard to factor cleanly. RL emergence is hard to attribute because the optimiser's path through the loss landscape was self-determined; with supervised learning, the path was data-determined. Self-supervised sits in between: the signal is dense (per-token loss) but the basins reached are not narrowly tied to any one downstream capability, which is why emergence in LLMs is often surprising. Unsupervised (clustering, density estimation) shows emergence less dramatically because the objective doesn't ask for any particular subskill. The whole comparison points forward to L5: which learning signal you choose determines what gradient information the optimiser actually receives, which determines which basins are reachable, which determines what can emerge.

Next station

Lesson 5 sits at the toolbox on the bench (station 5) and opens the drawers. Supervised, unsupervised, self-supervised, reinforcement: 4 fundamentally different shapes of learning signal, each one biasing which basins on the loss landscape the optimiser can reach at all.

← Lesson 3 Lesson 5 →