PHASE 1 ยท FOUNDATIONS OF INTELLIGENCE
01 / 78

What is an intelligence system?

Lesson 1. Phase 1: Foundations of intelligence. ~20 min read + cards + retrieval. Durability tier 1 (bedrock).

๐Ÿ’ก
Memory palace ยท Bench ยท station 1
The reading lamp. The lamp turns on, you sit down, you start thinking before you build.
Core idea. An intelligence system is a computational machine that maps input to output through a loop of representation, optimisation, and feedback, and most of modern AI is variations on that single skeleton.

Why this lesson exists

Public conversation treats AI as one thing. It isn't. A search ranker, a speech recogniser, a chess engine, a robot policy, and a transformer all get called "AI" but they share almost nothing about their internals, costs, or failure modes.

If you want to reason about AI as an engineer, you need a frame that separates them along the axes that matter: what they take in, what they represent inside, what signal they optimise against, what they hand back. Without that frame, every new headline lands as either magic or noise. This lesson installs the frame.

The shape of the field

AI systems rank billions of videos on YouTube, translate speech across 100+ languages in real time, write production code, drive cars in narrow operational domains, fold proteins from amino acid sequences, recommend most of what you watch and buy, and generate convincing imitations of human writing. These are radically different machines that share a label and almost nothing else. Public conversation lumps them all under "AI", which makes the field seem mystical when it's just a category too coarse for engineering thinking.

A "system," in the sense this course uses, is a box that does something with its inputs. A non-AI computer system runs a program someone wrote: the behaviour is the programmer's. Logic gates do exactly what the schematic says.

An intelligence system has a programmed skeleton, but most of its behaviour lives in a set of internal parameters that were tuned by data or experience, not by a person typing rules. The programmer wrote the loop. The data wrote what runs inside it.

The loop

The loop is consistent across the family. Five stages:

  1. Input. Pixels, words, audio, sensor readings, click logs, board states. Whatever the system gets to see.
  2. Representation. The system builds an internal encoding of that input. A vector, a tensor, a graph, a tree. Much of the interesting behaviour emerges here.
  3. Optimisation. The system has an objective. Predict the next token. Maximise reward. Minimise reconstruction error. Match a label. Internal parameters get nudged so the objective improves.
  4. Output. A prediction, a ranking, a generated artefact, an action.
  5. Feedback. The output meets the world, generates new data, and the loop closes.

Recommendation systems do this. Classifiers do this. Reinforcement-learning agents do this. Transformers do this. Robots do this.

They share the skeleton. They differ in what fills each box.

Same skeleton, different contents

Figure 1.1 shows the loop for 4 different intelligence systems side by side. The column headers (input, representation, optimisation, output, feedback) are the same across all rows. What differs is the cell contents.

INPUT REPRESENTATION OPTIMISATION OUTPUT FEEDBACK YouTube recommender user history vectors user + item embeddings (~200d) watch-time / click-through ranked list user actions on the ranking feedback Image classifier pixel array (H ร— W ร— 3) stacked conv features โ†’ logits cross-entropy vs. label class probability label correctness feedback AlphaZero board state value net + search tree self-play win rate move game result feedback Transformer LLM tokens layered attention context vectors next-token prediction loss probability over vocabulary sampled tokens fed back as input feedback
FIG 1.1. Four intelligence systems, same skeleton. The column headers (input, representation, optimisation, output, feedback) line up across all rows; what differs is the cell contents. Robotics is a fifth example with the same shape: joint angles + camera frames in, state vector representation, expected-return optimisation, torque commands out, reward signal back. The diagram makes one claim: these are the same loop with different contents.

A YouTube recommender takes user history as input, represents users and items as embeddings in maybe 200 dimensions, optimises some watch-time or click-through proxy, outputs a ranked list, and gets feedback from what the user actually does. An image classifier takes pixel arrays, builds stacked convolutional features, optimises cross-entropy against labels, outputs a class probability, and gets feedback from label correctness. AlphaZero takes a board state, represents it through a value network plus a search tree, optimises against self-play win rate, outputs a move, and gets feedback from the game result. A transformer LLM takes tokens, builds context representations through layered attention, optimises next-token prediction, outputs a probability distribution over the vocabulary. A robot policy takes joint angles and camera frames, represents them as a state vector, optimises expected return, outputs a torque command.

Same skeleton. Different contents.

From programmed rules to learned representations

The shift that produced modern AI is the move from programmed rules to learned representations. A 1990s expert system encoded rules a human wrote (if temperature > 80 and pressure rising then alert). Adding a rule meant a human typing it. The systems didn't scale because the combinatorics defeated their authors. Humans can write 50,000 rules for tax law. Humans cannot write the rules for "what's the next likely token in any English text," and they certainly cannot maintain them as language drifts.

So the field built systems that find their own rules by running data through an objective function and a gradient descent loop. The parameter tensor encodes the rules. Nobody wrote them by hand. You could inspect the tensor values in principle, but you wouldn't know what they encode without further work; that's the field of interpretability, much later in the course.

Hardware, from lesson 1

Hardware enters the story immediately, not as a postscript. A modern intelligence system's behaviour lives in its parameter tensors. Training one involves trillions of multiplications on those tensors, repeatedly. The silicon that can do that quickly was developed for video game graphics. Modern AI exists because, by accident of history, graphics processors turned out to be orders of magnitude faster than CPUs for the specific arithmetic modern AI depends on.

Concrete numbers. A GPT-3 sized model has ~175 billion parameters. At fp16, that's ~350 GB just to store the weights. A single H100 GPU has 80 GB of VRAM. So a model that size needs at least 5 H100s in parallel to hold itself, and the original training run cost roughly $5M in compute (2020 numbers). Every number in that sentence is a hardware constraint shaping what's possible to build.

That constraint pulls on the architecture, not just the cost. The shapes that fit GPUs (lots of matmul, parallelisable, predictable memory access) are the shapes that get built. Shapes that don't fit, like long sequential reasoning chains, get less attention because they don't ride the same hardware wave. You'll see this constraint pulling the field's shape throughout the course.

The long arc

Today's models sit in the seam where compute, memory, and energy budgets permit them. As all three change, the shape of what's buildable changes too. If photonic compute scales out of labs, matmul might get an order of magnitude cheaper, and architectures we've ruled out today become viable. If neuromorphic chips deliver sparse low-power inference, robotics and edge AI step into territory that battery limits currently hold them out of. None of that is guaranteed. But the field's history is a sequence of hardware shifts redrawing the map roughly every decade.

The right response to this is wonder. The mechanism (parameters tuned by gradients on matmul-friendly silicon) produced behaviour nobody predicted at design time. The wrong response is mystification. The behaviour came from a cause we can name: optimisation against an objective, on hardware that made the objective tractable to optimise. The mechanism is the explanation; the result is the wonder.

Modern AI accumulated, slowly. Parameter-rich models in the 1980s; cheap graphics hardware through the 1990s and 2000s; internet-scale datasets in the 2000s and 2010s; the transformer in 2017; the post-training stack that turned raw pretrained models into useful systems from 2018 onward. Each step answered a constraint the previous step ran into. Nothing about the field appeared from nowhere.

Hold the frame

Input โ†’ representation โ†’ optimisation โ†’ output โ†’ feedback, with parameters tuned by data, on hardware that makes the tuning possible. Every later lesson fills in the boxes for a specific kind of system: different representations, different optimisation loops, different hardware constraints, different failure modes, same skeleton.

The outline of the whole field is now in front of you. The rest of the course adds resolution.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L1 Pick 3 different intelligence systems (e.g., a recommender, a classifier, an RL agent). For each, describe what specifically fills the input, representation, optimisation, output, and feedback boxes. Be specific about what the optimisation signal actually is in each case.
A YouTube recommender: input is user-history vectors, representation is learned embeddings for users and items in ~200 dimensions, optimisation signal is a watch-time or click-through proxy, output is a ranked list, feedback is what the user does with the ranking. An image classifier: input is pixel arrays, representation is stacked convolutional features collapsed into a logits vector, optimisation signal is cross-entropy against ground-truth labels, output is a class probability, feedback is label correctness. AlphaZero: input is a board state, representation is a hybrid of value-network features and search-tree statistics, optimisation signal is win or loss from self-play (propagated through the moves that produced it), output is a move, feedback is the eventual game result. All 3 share the same loop. What fills the boxes is what makes them different systems.
L1 Why is it accurate to say modern AI "is downstream of hardware"? Give one concrete example where a hardware constraint shaped which architectural family won.
Modern intelligence systems live in parameter tensors trained by trillions of arithmetic operations on those tensors. The hardware that can do those operations cheaply is GPUs (originally designed for graphics) and their AI-specialised successors. That fact selects for architectures expressible as large matrix multiplies, which is exactly what transformers are. Problems that don't parallelise into matmul form (e.g., very long serial reasoning chains) get less attention because they don't ride the same hardware wave. Concrete example: vision transformers eroded CNNs in the early 2020s partly because attention is matmul-heavy in a way that maps cleanly onto tensor cores, whereas the spatial structure of convolutions was a less perfect fit. The architecture choice was downstream of what the silicon does best.
โ†ณ L2 (Forward interleave, since L1 has no earlier lesson to pull back to.) An intelligence system has to "find regularities" in its input data to do anything useful. In your own words, why might it be the case that good prediction and good compression are closely related?
To predict the next thing in a sequence, you need to know which things are likely and which aren't, given the context. Knowing that is the same as knowing which things you could encode in fewer bits, because the likely things deserve shorter codes and the unlikely things get longer ones. A system that can predict well has built an internal model of what's regular and what's surprising in its input, which is exactly what a compressor needs. The link is not coincidence; it's the same problem dressed two ways. L2 picks this up and runs with it.

Next station

Lesson 2: Pattern, prediction, compression. The next station along the bench is the blank page. The connection underneath every intelligence system's representation work is that predicting well and compressing well are the same problem.