You've finished Phase 1. You have the systems vocabulary: prediction, compression, generalisation, emergence, learning signals, sequential decision-making, representation, tokenisation, embedding geometry. The honest thing to do before opening Phase 2's maths is to draw, with that vocabulary, the actual perimeter of what modern AI can and cannot do.
Public discourse around AI rarely does this. The dominant frames are "this changes everything" and "this is statistical autocomplete." Both make the same error: they treat capability as a single number. Real capability is a surface in many dimensions, and the surface is jagged. Tasks that feel human and difficult are sometimes trivial for modern systems; tasks that feel obviously easy are sometimes catastrophically hard.
This lesson is a map of the surface, drawn mechanistically.
Three broad regions, all real and simultaneously true of the same model.
Strong region. Tasks where pattern completion against a vast training corpus, supported by good representations, is most of what's needed. Code completion against well-known APIs. Translation between high-resource languages. OCR. Summarisation of in-distribution documents. Semantic search across a corpus the embeddings have absorbed. Multimodal captioning of common imagery. The training signal aligned closely with the deployment task; the optimiser was rewarded for producing exactly what's now being asked for.
Soft region. Tasks where the answer requires recombining training-data patterns in ways that are plausible but not directly memorised. Most LLM "creative writing", routine analysis, drafting of standard professional documents, explaining well-known technical concepts. Output is fluent, coherent, and usually correct on familiar topics; confidently wrong on details outside the training distribution.
Weak region. Tasks that require sustained planning, long-horizon coherence, novel composition of subskills, grounding in the physical world, or accurate computation the training data didn't supply directly. Robotic manipulation in unstructured environments. Multi-step mathematical proofs the model has never seen. Agentic loops requiring 50 self-consistent steps. Reliable arithmetic at lengths the tokeniser doesn't favour.
The boundary between these regions doesn't map onto human difficulty. "Translate this paragraph from Japanese to English" is easy for a modern LLM and hard for most humans. "Reverse the letters in 'strawberry'" is hard for a modern LLM (because of tokenisation, L8) and trivial for any human.
A useful frame from classical machine learning. Interpolation is producing outputs for inputs within the convex hull of training data; the model has seen things like this and is filling in. Extrapolation is producing outputs for inputs outside that hull; the model has to extend rather than fill.
Modern AI is excellent at interpolation, much weaker at extrapolation. This is mechanically clean: the optimiser was rewarded for matching the training distribution. Inputs near the centre get fluent, accurate responses. Inputs at the edge get plausible-but-shakier responses. Inputs beyond it get confident extrapolation that may or may not be right.
The training distribution is enormous for frontier LLMs (most of the public internet, much of the published scientific literature, large code corpora) so the interpolation region is vast. But it has edges. Novel mathematical proofs sit outside it. Novel engineering designs sit outside it. The exact tail of a long-tail medical condition sits outside it. The system extrapolating into those regions doesn't know it has crossed the boundary.
The most-cited failure mode and the most-misunderstood. Hallucination isn't the model lying. It's the predictable behaviour of a next-token optimiser that has no truth-grounding mechanism.
The training objective rewards the model for producing tokens that are statistically likely given the context. It does not reward the model for producing tokens that are true. Most of the time these align (true statements are usually statistically likely in a well-trained model) but not always. When a query lands outside the model's reliable interpolation region, the most-likely-next-tokens often form a plausible but fictional answer. Hallucinated legal citations, fabricated bibliographies, invented function names in code, confident wrong arithmetic, confidently misremembered facts. All of it is the same mechanism: prediction without grounding.
The model has no calibrated uncertainty signal it exposes. A confident answer and a hallucination look the same from outside. The internal distributions over next tokens carry some uncertainty information but it doesn't reliably surface in the generated text.
Mitigation comes from grounding the prediction in external facts. Retrieval-augmented generation (RAG, L59) is the dominant pattern: retrieve relevant documents at query time, condition generation on them, ask the model to cite. This doesn't eliminate hallucination but reduces it sharply on factual queries. Tool use (let the model call a calculator, a search index, a database) reduces it further on tasks where ground truth is available externally.
Modern LLMs reason. The reasoning is pattern-based, not symbolic, and the distinction matters.
Pattern-based reasoning means the model has seen many examples of structured arguments, mathematical proofs, code derivations, and so on, and learned to produce similarly-structured outputs. For tasks whose structure resembles training data, the resulting reasoning is often correct. For tasks that require genuinely novel composition of subskills, it stumbles.
Chain-of-thought prompting (asking the model to "think step by step") improves performance because it gives the model more inference-time compute and more tokens to work with. Each intermediate token gives the model another chance to refine its internal state. Test-time compute scaling, where the model is given large inference budgets to explore many reasoning paths, has produced the so-called reasoning models from 2024 onward. These are still pattern-based reasoners with more time to think.
The brittle part is compositionality. A model that can solve problem A and problem B reliably will often fail on a problem requiring composition of A and B in a novel way, especially if that composition wasn't well-represented in training data. Downstream of representations: compositional structure that the embeddings encoded transfers; compositional structure they didn't has to be reconstructed inside the layers each time.
The hardest region of the capability map. Modern AI is far weaker at controlling a physical robot than at producing fluent text.
The reasons trace back through the phase. Training data is sparse (every robot interaction is expensive in wall-clock time, L6). State representation has to carry partial-observability noise, sensor uncertainty, and actuation latency. The reward signal is delayed and shaped by physical consequences the model can't directly observe. The optimiser has to learn through interaction, with all the sample-inefficiency RL brings.
Simulation closes some of the gap (train at 1000× wall-clock speed in software, transfer to real) but sim-to-real is its own distribution-shift problem. Domain randomisation, real-world fine-tuning, and stronger inductive bias in state representations all help. Modern robot manipulation has improved sharply since 2020. Reliable grasping of unfamiliar objects in cluttered environments remains an open problem in 2026.
The general principle: embodied interaction with the physical world is harder than text prediction because the world doesn't come with labels, the data pipeline doesn't scale to internet sizes, and small errors compound across trajectories. Grounding is the bottleneck.
The capability profile shifts substantially when the model is paired with retrieval and tool use.
RAG systems consistently outperform standalone LLMs on factual question-answering, technical reference lookup, and any task where the answer is locatable in a curated corpus. The model's role becomes synthesis and presentation; the facts come from retrieved context. Factuality improves; hallucination on retrieved-domain queries drops; the maintenance model shifts from "retrain the LLM" to "update the index".
Tool use extends this. A model that can call a calculator does arithmetic at machine precision. A model that can query a SQL database answers factual queries about that database with database-precision. A model that can invoke a search engine acquires current information past its training cutoff. The capability surface of the system is much larger than the capability surface of the underlying model.
Production AI systems are usually hybrids: an LLM doing language synthesis, with retrieval, tool use, structured constraints, and classical machine-learning components surrounding it. Treating the LLM as the whole system is a frame mistake.
Capability is gated by what compute the system has at inference time.
The context window is the most visible constraint. Past a certain token count, attention compute and KV cache memory grow until further extension is uneconomic. Long-context degradation ("lost in the middle") is a known effect: even within the formal context limit, attention to tokens far from the current position is weaker, and information placed there is less reliably retrieved.
Inference latency caps interactive use. A model that takes 30 seconds to produce a response can do more reasoning per query than a model constrained to 100 ms but loses interactivity. The trade-off is per-deployment.
Test-time compute scaling has become a major lever in 2024-2026. Spending more compute at inference (more chain-of-thought tokens, more sampled solutions, more iterative refinement) directly improves capability on hard problems. The cost is wall-clock latency and dollars per query. Memory bottlenecks limit what models can be served on what hardware. A 70B-parameter model with 32K context window won't fit on a single consumer GPU; quantisation, smaller models, and aggressive KV-cache compression are the production responses. Hardware does not just constrain training; it shapes which capability profiles are economically reachable in deployment.
A short catalogue of how systems fail in practice. Hallucinated citations in research outputs. Confident wrong arithmetic on multi-digit problems. Plausible-but-fictional function signatures in code generation. Brittle agent planning that loses coherence after 10-20 steps. Multimodal misclassification of out-of-distribution images. Prompt-injection vulnerabilities where adversarial input changes downstream behaviour. Overconfident output on out-of-distribution medical or legal queries.
None of these are mysterious. Each maps cleanly to mechanism: optimisation against a particular objective, with a particular representation, on particular hardware, encountering inputs the training distribution did not cover.
Figure 10.1 puts the strong, soft, and weak regions on one chart. The jagged perimeter is the central claim. The contour overlays show how adding retrieval (RAG) or test-time compute reshapes specific axes without making the whole surface uniform.
In L1's terms, capability is what the system loop produces when run end to end. In L2's terms, modern AI excels where prediction and compression of the training distribution generalise to deployment. In L3's terms, capability sits inside the training distribution and the inductive biases that bridge to it. In L4's terms, capability emerges from scale only when the right representation makes the algorithm reachable. In L5's terms, capability depends entirely on what learning signal shaped the system. In L6's terms, long-horizon and embodied capability is hard because the signal is sparse and delayed. In L7's terms, capability is downstream of representation quality. In L8's terms, tokenisation quietly determines what's expressible and where the system stumbles. In L9's terms, retrieval is geometry, and that geometry is where reliable factuality now lives.
AI is neither magic nor fake. It is a set of optimisation systems whose strengths and weaknesses are downstream of objective, representation, architecture, training data, and hardware. The capability surface is jagged. The boundary between "trivial for this system" and "fails confidently" doesn't follow human intuition.
Once you can think mechanistically about why a system succeeds or fails, the field stops being surprising in either direction.
This closes Phase 1. The bench has 10 stations: an intelligence system, prediction-as-compression, generalisation, emergence, learning signals, sequential decision-making, representation, tokens, embeddings, and an honest capability map. The next phase opens the whiteboard wall and writes down the maths.
The bench is complete. Lesson 11 walks across the workshop to the whiteboard wall (station 11) and writes vectors, where Phase 1's geometric intuition becomes Phase 2's mathematical apparatus.