A neural net trained to predict the next token can do 3-digit addition. Smaller nets in the same family cannot. The capability arrived somewhere along the scaling curve. Looked at from outside, the model "learned arithmetic" at a particular size.
Two reactions are common. The first is mystical: the model crossed a hidden boundary and acquired something the smaller siblings lacked. The second is dismissive: it's an evaluation trick and the whole effect is benchmark theatre.
Both skip the mechanism. The mechanism is what this lesson installs.
Call a behaviour emergent when it is absent at smaller scale, present at larger scale, with a visibly sharp transition between the two. The sharpness lives on whatever axis you happen to be measuring. The question is what's happening inside.
The first thing is real mechanism change at scale. The model's internal state crosses a point where some algorithm becomes implementable in the weights, or where optimisation settles into a different basin of the loss surface, or where a representation gets sharp enough to support a downstream subskill. The capability appears because something inside the system genuinely changed.
The second is measurement-shaped sharpness. The underlying capability rose smoothly, but the benchmark scoring is thresholded (a 4-digit answer is either right or wrong, no partial credit), so the score stays at zero until the partial capability is good enough to clear the bar, then jumps. Same internal trajectory, different visible curve.
Both happen. Both are real. Treating them as the same thing is the source of most confusion in this area.
Drop conducting beads at random into an insulating substrate. At low density, the beads form isolated clusters and current cannot get across. Past a critical density, a connected path threads the substrate end-to-end and the substrate conducts. The transition is sharp on the conductance axis. The density that produces it is continuous, and the mechanism (beads touching beads) is fully understood.
A scaling neural net has something of this flavour. As parameters, data, and optimisation steps increase, the representations the network can form expand in capacity. At some scale, the representation needed for a particular subskill becomes reachable in the available representation capacity. Below that scale, the optimiser cannot fit the subskill no matter how long it runs; above it, the optimiser can find it. Reachability is binary at the algorithm level. The capacity that produces reachability is continuous.
The analogy stops at decoration. It does not predict where a given threshold sits for a given subskill, and it shouldn't be pushed past its framing role. The point is that "continuous input, discontinuous observable" is a known engineering shape, not something specific to AI. Physicists call the general pattern a phase transition.
A branch predictor with a 1-bit history cannot catch a pattern of period 4. It thrashes against the pattern at roughly 50% accuracy. Two bits of history, still useless. Push the history past the pattern's period and the predictor goes from useless to nearly perfect on that pattern class, almost overnight. The history grew linearly; the observable performance jumped at a specific point.
Continuous resource change, threshold-shaped capability appearance, fully understood mechanism. This is a threshold effect running on silicon, in a budget of a few hundred bits, and it's been shipping in CPUs for decades.
A cache slightly too small for a workload's working set thrashes. Hit rate is low. Add a few percent of capacity, crossing the working set size, and hit rate climbs from low to near-complete in a narrow band. Capacity changed smoothly. Observed performance changed sharply.
Same shape as the predictor example. Same shape as the percolation example. Same shape, it turns out, as much of what gets reported as "emergent" in scaling neural nets.
In-context learning is a worked example. A small transformer effectively cannot use prompt examples to perform tasks it wasn't directly trained on. A large transformer can. Somewhere along the scaling curve, the network develops attention patterns that route information from earlier in the context to influence later predictions in task-specific ways. The capacity to encode that routing in the weights, and the optimisation signal that pushes the weights toward it, both grow with scale. Below a certain combined budget, the routing isn't representable in the available capacity or isn't reachable from the optimiser's starting point. Above it, both conditions clear. The behaviour appears.
Chain-of-thought prompting has the same shape. Small models cannot meaningfully use the extra inference-time compute a reasoning chain offers, because they don't have the internal abstractions needed to chain steps reliably. Larger models do. The capability is gated by what the representations can support, not by anything in the prompt itself.
Picture the loss surface as terrain. A smaller model's parameter space supports a smaller set of distinct minima the optimiser might end up in. A larger model supports more, and qualitatively different, basins. Some of those basins correspond to algorithmic strategies smaller models had no room for. Scaling does not only make existing minima deeper. It adds basins that weren't reachable before.
The right question becomes not "did the model learn X" but "did the optimiser, given this scale and this data, fall into the optimisation basin that implements X." Sometimes the basin exists at the new scale and the optimiser misses it. Sometimes the basin only exists past a particular scale. Sometimes the basin is reachable and used, but the visible capability stays hidden behind a thresholded measurement. All three look the same from outside if you're squinting at a benchmark score.
AlphaGo's smaller policy networks played positionally reasonable Go without surprising moves. AlphaGo Zero and later AlphaZero, with larger networks and richer self-play data, played moves human masters had not produced in tournament play, including move 37 of game 2 against Lee Sedol. The strategy space the larger policy networks could represent included plans the smaller networks couldn't. The optimisation procedure (self-play, MCTS-guided policy improvement) settled into different basins as capacity grew.
Neither the policy nor the discovery was mystical. The capacity to represent the strategy and the gradient signal to find it showed up together. The system did not "decide" to invent move 37. The optimiser fell into a basin that smaller networks did not contain.
The Schaeffer et al. result (2023) is the cleanest worked example of measurement-shaped emergence. Several "emergent" capabilities reported on large LLMs softened or vanished when the benchmark scoring was switched from thresholded (binary correctness on full answers) to continuous (per-token log-likelihood). The capability had been rising the whole time. The binary metric couldn't see it until enough was there to clear the all-or-nothing bar.
This does not erase all emergence claims. Some capabilities still show non-linear curves on continuous metrics. It does mean that any claim of the form "capability X appeared suddenly at scale Y" should be cross-checked against a continuous metric before being treated as load-bearing. Apparent suddenness from continuous underlying change is the default explanation. Real mechanism change is the case you have to demonstrate.
Scaling depends on memory bandwidth between chips, interconnect across racks, and training stability over weeks of wall-clock time. A capability that would appear at 70B parameters is academic if you cannot afford to train it. A capability hidden behind 10 trillion training tokens is unreachable if the data pipeline can't sustain the throughput.
The set of emergences economically reachable in any given year is bounded by the chips available, the bandwidth across them, and the stability of training runs at the relevant scale. The hardware does not generate the emergence. It gates which emergences are observable.
This is the systems-engineering frame on the topic. Capability formation in modern AI is not a story about software alone. It's a story about what specific arrangements of silicon, memory, and interconnect happen to make tractable in a given year.
Figure 4.1 separates the two views. The top plot puts the smooth internal curve and the sharp external curve on the same axes so the gap between them is visible. The bottom plot shows the optimisation-basin picture: scale doesn't only sharpen existing minima, it expands the set of basins the optimiser can reach.
In L1's system loop, scaling enriches the representation: more parameters, more positions in the embedding space, more capacity for the internal state to encode regularities. In L2's terms, a larger system can encode shorter descriptions of more complex regularities, so prediction improves where smaller systems were forced to memorise. In L3's terms, the abstractions that survive distribution shift become reachable at scales they weren't reachable at before. Emergence is what those three together look like from outside when the visible measurement is a thresholded benchmark on a compositional subskill.
It does not say that all emergent claims are real. It does not say that all apparent emergence is artefact. It does not say the model is "waking up." It does not say consciousness sits anywhere in the picture.
The mechanism, where it's been chased down, has always been a combination of representation capacity, optimisation dynamics, and the shape of the measurement. The phenomenon is real. The explanation is mechanistic. The open questions (which capabilities are sharp by mechanism versus sharp by measurement, which optimisation basins are reachable at which scales, how compositional subskills combine into emergent behaviours) are interpretability questions. They are not mystical ones.
Complex systems can develop qualitatively new capabilities through scaling and optimisation pressure without violating mechanistic explanation. Continuous resource change can produce threshold-shaped capability appearance. The threshold can sit in the algorithm becoming representable, in the basin becoming reachable, or in the metric becoming sensitive enough. The honest engineering question is always: which of those, and how would I check.
The blank page (L2) taught you that prediction is the engine. The folded map (L3) taught you that the engine only matters if what it builds inside survives the journey out of the training distribution. The dust caught in sunlight teaches you that what survives the journey can change in kind when there is enough of it, and that the change is mechanistic even when it looks magical.
Lesson 5 sits at the toolbox on the bench (station 5) and opens the drawers. Supervised, unsupervised, self-supervised, reinforcement: 4 fundamentally different shapes of learning signal, each one biasing which basins on the loss landscape the optimiser can reach at all.