PHASE 1 · FOUNDATIONS OF INTELLIGENCE

10 / 78

The current AI capability perimeter

Lesson 10. Phase 1: Foundations of intelligence. ~26 min read + cards + retrieval. Durability tier 3 (the perimeter shifts year over year; the mechanism behind it does not).

🔍

Memory palace · Bench · station 10

The dust cover. Lift it and you see the actual outline of the machine. Distortions and hype removed; the real operational perimeter underneath.

Core idea. Modern AI systems are highly capable but deeply uneven optimisation systems whose strengths and weaknesses emerge from their training objectives, representations, architectures, and hardware constraints.

Why this lesson exists

You've finished Phase 1. You have the systems vocabulary: prediction, compression, generalisation, emergence, learning signals, sequential decision-making, representation, tokenisation, embedding geometry. The honest thing to do before opening Phase 2's maths is to draw, with that vocabulary, the actual perimeter of what modern AI can and cannot do.

Public discourse around AI rarely does this. The dominant frames are "this changes everything" and "this is statistical autocomplete." Both make the same error: they treat capability as a single number. Real capability is a surface in many dimensions, and the surface is jagged. Tasks that feel human and difficult are sometimes trivial for modern systems; tasks that feel obviously easy are sometimes catastrophically hard.

This lesson is a map of the surface, drawn mechanistically.

Capability is jagged

Three broad regions, all real and simultaneously true of the same model.

Strong region. Tasks where pattern completion against a vast training corpus, supported by good representations, is most of what's needed. Code completion against well-known APIs. Translation between high-resource languages. OCR. Summarisation of in-distribution documents. Semantic search across a corpus the embeddings have absorbed. Multimodal captioning of common imagery. The training signal aligned closely with the deployment task; the optimiser was rewarded for producing exactly what's now being asked for.

Soft region. Tasks where the answer requires recombining training-data patterns in ways that are plausible but not directly memorised. Most LLM "creative writing", routine analysis, drafting of standard professional documents, explaining well-known technical concepts. Output is fluent, coherent, and usually correct on familiar topics; confidently wrong on details outside the training distribution.

Weak region. Tasks that require sustained planning, long-horizon coherence, novel composition of subskills, grounding in the physical world, or accurate computation the training data didn't supply directly. Robotic manipulation in unstructured environments. Multi-step mathematical proofs the model has never seen. Agentic loops requiring 50 self-consistent steps. Reliable arithmetic at lengths the tokeniser doesn't favour.

The boundary between these regions doesn't map onto human difficulty. "Translate this paragraph from Japanese to English" is easy for a modern LLM and hard for most humans. "Reverse the letters in 'strawberry'" is hard for a modern LLM (because of tokenisation, L8) and trivial for any human.

Interpolation vs extrapolation

A useful frame from classical machine learning. Interpolation is producing outputs for inputs within the convex hull of training data; the model has seen things like this and is filling in. Extrapolation is producing outputs for inputs outside that hull; the model has to extend rather than fill.

Modern AI is excellent at interpolation, much weaker at extrapolation. This is mechanically clean: the optimiser was rewarded for matching the training distribution. Inputs near the centre get fluent, accurate responses. Inputs at the edge get plausible-but-shakier responses. Inputs beyond it get confident extrapolation that may or may not be right.

The training distribution is enormous for frontier LLMs (most of the public internet, much of the published scientific literature, large code corpora) so the interpolation region is vast. But it has edges. Novel mathematical proofs sit outside it. Novel engineering designs sit outside it. The exact tail of a long-tail medical condition sits outside it. The system extrapolating into those regions doesn't know it has crossed the boundary.

Hallucination

The most-cited failure mode and the most-misunderstood. Hallucination isn't the model lying. It's the predictable behaviour of a next-token optimiser that has no truth-grounding mechanism.

The training objective rewards the model for producing tokens that are statistically likely given the context. It does not reward the model for producing tokens that are true. Most of the time these align (true statements are usually statistically likely in a well-trained model) but not always. When a query lands outside the model's reliable interpolation region, the most-likely-next-tokens often form a plausible but fictional answer. Hallucinated legal citations, fabricated bibliographies, invented function names in code, confident wrong arithmetic, confidently misremembered facts. All of it is the same mechanism: prediction without grounding.

The model has no calibrated uncertainty signal it exposes. A confident answer and a hallucination look the same from outside. The internal distributions over next tokens carry some uncertainty information but it doesn't reliably surface in the generated text.

Mitigation comes from grounding the prediction in external facts. Retrieval-augmented generation (RAG, L59) is the dominant pattern: retrieve relevant documents at query time, condition generation on them, ask the model to cite. This doesn't eliminate hallucination but reduces it sharply on factual queries. Tool use (let the model call a calculator, a search index, a database) reduces it further on tasks where ground truth is available externally.

Reasoning

Modern LLMs reason. The reasoning is pattern-based, not symbolic, and the distinction matters.

Pattern-based reasoning means the model has seen many examples of structured arguments, mathematical proofs, code derivations, and so on, and learned to produce similarly-structured outputs. For tasks whose structure resembles training data, the resulting reasoning is often correct. For tasks that require genuinely novel composition of subskills, it stumbles.

Chain-of-thought prompting (asking the model to "think step by step") improves performance because it gives the model more inference-time compute and more tokens to work with. Each intermediate token gives the model another chance to refine its internal state. Test-time compute scaling, where the model is given large inference budgets to explore many reasoning paths, has produced the so-called reasoning models from 2024 onward. These are still pattern-based reasoners with more time to think.

The brittle part is compositionality. A model that can solve problem A and problem B reliably will often fail on a problem requiring composition of A and B in a novel way, especially if that composition wasn't well-represented in training data. Downstream of representations: compositional structure that the embeddings encoded transfers; compositional structure they didn't has to be reconstructed inside the layers each time.

Robotics and grounding

The hardest region of the capability map. Modern AI is far weaker at controlling a physical robot than at producing fluent text.

The reasons trace back through the phase. Training data is sparse (every robot interaction is expensive in wall-clock time, L6). State representation has to carry partial-observability noise, sensor uncertainty, and actuation latency. The reward signal is delayed and shaped by physical consequences the model can't directly observe. The optimiser has to learn through interaction, with all the sample-inefficiency RL brings.

Simulation closes some of the gap (train at 1000× wall-clock speed in software, transfer to real) but sim-to-real is its own distribution-shift problem. Domain randomisation, real-world fine-tuning, and stronger inductive bias in state representations all help. Modern robot manipulation has improved sharply since 2020. Reliable grasping of unfamiliar objects in cluttered environments remains an open problem in 2026.

The general principle: embodied interaction with the physical world is harder than text prediction because the world doesn't come with labels, the data pipeline doesn't scale to internet sizes, and small errors compound across trajectories. Grounding is the bottleneck.

Retrieval and external memory

The capability profile shifts substantially when the model is paired with retrieval and tool use.

RAG systems consistently outperform standalone LLMs on factual question-answering, technical reference lookup, and any task where the answer is locatable in a curated corpus. The model's role becomes synthesis and presentation; the facts come from retrieved context. Factuality improves; hallucination on retrieved-domain queries drops; the maintenance model shifts from "retrain the LLM" to "update the index".

Tool use extends this. A model that can call a calculator does arithmetic at machine precision. A model that can query a SQL database answers factual queries about that database with database-precision. A model that can invoke a search engine acquires current information past its training cutoff. The capability surface of the system is much larger than the capability surface of the underlying model.

Production AI systems are usually hybrids: an LLM doing language synthesis, with retrieval, tool use, structured constraints, and classical machine-learning components surrounding it. Treating the LLM as the whole system is a frame mistake.

Hardware and systems constraints

Capability is gated by what compute the system has at inference time.

The context window is the most visible constraint. Past a certain token count, attention compute and KV cache memory grow until further extension is uneconomic. Long-context degradation ("lost in the middle") is a known effect: even within the formal context limit, attention to tokens far from the current position is weaker, and information placed there is less reliably retrieved.

Inference latency caps interactive use. A model that takes 30 seconds to produce a response can do more reasoning per query than a model constrained to 100 ms but loses interactivity. The trade-off is per-deployment.

Test-time compute scaling has become a major lever in 2024-2026. Spending more compute at inference (more chain-of-thought tokens, more sampled solutions, more iterative refinement) directly improves capability on hard problems. The cost is wall-clock latency and dollars per query. Memory bottlenecks limit what models can be served on what hardware. A 70B-parameter model with 32K context window won't fit on a single consumer GPU; quantisation, smaller models, and aggressive KV-cache compression are the production responses. Hardware does not just constrain training; it shapes which capability profiles are economically reachable in deployment.

Failure modes

A short catalogue of how systems fail in practice. Hallucinated citations in research outputs. Confident wrong arithmetic on multi-digit problems. Plausible-but-fictional function signatures in code generation. Brittle agent planning that loses coherence after 10-20 steps. Multimodal misclassification of out-of-distribution images. Prompt-injection vulnerabilities where adversarial input changes downstream behaviour. Overconfident output on out-of-distribution medical or legal queries.

None of these are mysterious. Each maps cleanly to mechanism: optimisation against a particular objective, with a particular representation, on particular hardware, encountering inputs the training distribution did not cover.

The capability map

Figure 10.1 puts the strong, soft, and weak regions on one chart. The jagged perimeter is the central claim. The contour overlays show how adding retrieval (RAG) or test-time compute reshapes specific axes without making the whole surface uniform.

FIG 10.1. Capability radar across 10 axes. Centre = low; edge = high. The base capability polygon (amber) is irregular: strong on translation, OCR, code generation, retrieval; medium on factual recall and arithmetic; weak on planning, long-horizon coherence, grounding, robotics. Green arrows show how retrieval-augmented generation (RAG) pushes the factual-recall axis outward. Blue arrows show how test-time compute pushes planning and arithmetic outward, at the cost of latency and dollars per query. Confidence-vs-correctness colour codes mark where the model is reliable, uncertain, or likely-wrong. Hardware constraints (right panel) bound the entire surface.

The L1 to L9 view

In L1's terms, capability is what the system loop produces when run end to end. In L2's terms, modern AI excels where prediction and compression of the training distribution generalise to deployment. In L3's terms, capability sits inside the training distribution and the inductive biases that bridge to it. In L4's terms, capability emerges from scale only when the right representation makes the algorithm reachable. In L5's terms, capability depends entirely on what learning signal shaped the system. In L6's terms, long-horizon and embodied capability is hard because the signal is sparse and delayed. In L7's terms, capability is downstream of representation quality. In L8's terms, tokenisation quietly determines what's expressible and where the system stumbles. In L9's terms, retrieval is geometry, and that geometry is where reliable factuality now lives.

The takeaway

AI is neither magic nor fake. It is a set of optimisation systems whose strengths and weaknesses are downstream of objective, representation, architecture, training data, and hardware. The capability surface is jagged. The boundary between "trivial for this system" and "fails confidently" doesn't follow human intuition.

Once you can think mechanistically about why a system succeeds or fails, the field stops being surprising in either direction.

This closes Phase 1. The bench has 10 stations: an intelligence system, prediction-as-compression, generalisation, emergence, learning signals, sequential decision-making, representation, tokens, embeddings, and an honest capability map. The next phase opens the whiteboard wall and writes down the maths.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L10 Take 5 tasks: (a) translating a paragraph of journalistic English into idiomatic Japanese, (b) answering "how many R's are in 'strawberry'?", (c) writing a Python function that calls a popular HTTP library, (d) planning a multi-step robotic task to assemble an IKEA chair from photos and instructions, (e) summarising the key findings of an attached 50-page technical report. For each, predict whether a modern frontier LLM (without RAG, without tools) would succeed or fail, and explain the mechanistic reason rooted in the phase's vocabulary.

(a) Translation. Succeeds reliably. Translation between high-resource languages is heavily represented in training data (paired corpora, news, books), so the prediction objective directly rewarded producing fluent Japanese given English context. The task is interpolation within the training distribution. Embedding geometry across languages is aligned by training on parallel data. Modern LLMs do this at near-professional quality. (b) Letter-counting in "strawberry". Fails or unreliable. Tokenisation hides individual characters: "strawberry" is one or two subword tokens, and the model has no direct access to the constituent letters. The task requires character-level operations the architecture isn't built for. The model reasons about spelling from training-data statistics, which produces inconsistent results. (c) Python HTTP function. Succeeds reliably for popular libraries (requests, httpx) in canonical usage patterns. Code completion is well-represented in training corpora (GitHub, Stack Overflow, documentation), and idiomatic usage is exactly what the prediction objective rewarded. Fails when the library is obscure, when the task requires novel composition, or when the specific API has changed since training. (d) Multi-step robotic IKEA assembly. Fails. The task requires (i) grounded physical interpretation of photos and instructions, (ii) long-horizon planning across many steps, (iii) embodied execution with continuous sensorimotor feedback. None align with the LLM's training objective. Even with strong vision-language understanding for the photos, translating that into a reliable physical assembly policy is current frontier robotics, not text-prediction territory. (e) 50-page summarisation. Partial. Depends on the document fitting within the context window and the relevant content being well-distributed (not lost-in-the-middle). Summarisation aligns well with the training objective. Quality degrades for technical content outside training distribution, for documents that exceed context limits, or where reading-comprehension precision matters more than fluent paraphrase. The general principle: capability tracks the intersection of training-objective alignment, representation coverage, and hardware budget at inference.

L10 A team is asked to build an internal knowledge-base assistant for their engineering organisation: search across design documents, answer technical questions, retrieve relevant past decisions. The naive approach is "just plug in a frontier LLM and let it answer". Why is this naive, and what's the production answer? Explain mechanistically, with reference to hallucination, retrieval geometry, and confidence calibration.

The naive approach fails for several reasons that map directly to phase-1 mechanism. (1) The internal documents weren't in the LLM's training corpus, so any factual question about them lives outside the model's interpolation region. The model will produce confident-sounding answers extrapolating from general knowledge plus pattern completion against question structure; these will often be plausible but wrong. (2) Hallucination is most severe in this regime: out-of-distribution factual queries with no grounding. The model has no truth-grounding mechanism; the training objective rewards likely tokens, not correct ones. (3) Confidence calibration in the generated text is poor; the LLM produces equally fluent answers whether it actually knows the content or is fabricating. Engineers reading the output cannot easily distinguish. The production answer is a hybrid system. (a) Embed the internal corpus using a quality embedding model (L9). (b) On each user query, embed the query and retrieve the top-K most relevant documents via approximate nearest-neighbour search. (c) Pass the retrieved documents into the LLM's context window with a prompt that says "answer based on these documents; cite the source; if the documents don't contain the answer, say so". (d) The LLM now does what it's good at (synthesis, language) on grounded inputs, and the factual content comes from retrieval. (e) Surface citations to the user so they can verify. Mechanistically this works because retrieval restores the interpolation property at the system level: the LLM is now generating from a context that contains the answer, which is what the prediction objective rewarded. Hallucination drops. Confidence-vs-correctness becomes much easier to assess because the answer is anchored to specific retrieved passages. This pattern (retrieval plus LLM plus citation) is the production standard for internal knowledge systems and is downstream of L9 (embedding geometry), L10 (the capability map), and L59 (full RAG treatment).

↳ L11 (Forward interleave to Phase 2 / L11, vectors.) Phase 1 closes here with the honest capability map. Phase 2 opens at the whiteboard wall with the maths. Lesson 11 is vectors. From everything you've seen across Phase 1 (especially embeddings, geometry, attention-as-similarity, and the meta-claim that the model computes through geometry), what's the natural next question about vectors that L11 needs to answer, and why does it matter for understanding modern architectures? Why is this the right place to open Phase 2 rather than starting with, say, calculus or probability?

The natural next questions about vectors are: what exactly is a vector, what operations on vectors carry meaning, how does direction differ from magnitude, what does "distance" actually mean in a high-dimensional space, and how do projections work. These matter because every downstream piece of modern AI architecture is built on top of vector operations. Attention computes similarities between vectors. MLPs project vectors into different spaces. Embedding tables produce vectors. Loss functions measure distances between predicted and target vectors. The whole architecture is, at the implementation level, large blocks of matmuls. So vectors are the substrate that everything else lives on. Calculus matters too (gradients, the chain rule, backprop) but it's downstream: you need to know what the gradient is of before you can take its derivative, and what it's of is loss as a function of vectors and matrices. Probability matters (softmax, sampling, entropy) but it's a tool applied to vectors and to distributions over them. Starting with vectors is the right move because every other piece of phase-2 maths attaches to them. By the end of phase 2 you'll have the minimum maths to read the rest of the course without bluffing, and it will all hang off the geometric intuition that L9 already started: nearby vectors mean similar things, directions encode operations, projections move between spaces. The compass, the mirror, the toolbox, and the bench all stop being metaphors and start being maths.

Next station

The bench is complete. Lesson 11 walks across the workshop to the whiteboard wall (station 11) and writes vectors, where Phase 1's geometric intuition becomes Phase 2's mathematical apparatus.

← Lesson 9 Synthesis 1 →