Public conversation treats AI as one thing. It isn't. A search ranker, a speech recogniser, a chess engine, a robot policy, and a transformer all get called "AI" but they share almost nothing about their internals, costs, or failure modes.
If you want to reason about AI as an engineer, you need a frame that separates them along the axes that matter: what they take in, what they represent inside, what signal they optimise against, what they hand back. Without that frame, every new headline lands as either magic or noise. This lesson installs the frame.
AI systems rank billions of videos on YouTube, translate speech across 100+ languages in real time, write production code, drive cars in narrow operational domains, fold proteins from amino acid sequences, recommend most of what you watch and buy, and generate convincing imitations of human writing. These are radically different machines that share a label and almost nothing else. Public conversation lumps them all under "AI", which makes the field seem mystical when it's just a category too coarse for engineering thinking.
A "system," in the sense this course uses, is a box that does something with its inputs. A non-AI computer system runs a program someone wrote: the behaviour is the programmer's. Logic gates do exactly what the schematic says.
An intelligence system has a programmed skeleton, but most of its behaviour lives in a set of internal parameters that were tuned by data or experience, not by a person typing rules. The programmer wrote the loop. The data wrote what runs inside it.
The loop is consistent across the family. Five stages:
Recommendation systems do this. Classifiers do this. Reinforcement-learning agents do this. Transformers do this. Robots do this.
They share the skeleton. They differ in what fills each box.
Figure 1.1 shows the loop for 4 different intelligence systems side by side. The column headers (input, representation, optimisation, output, feedback) are the same across all rows. What differs is the cell contents.
A YouTube recommender takes user history as input, represents users and items as embeddings in maybe 200 dimensions, optimises some watch-time or click-through proxy, outputs a ranked list, and gets feedback from what the user actually does. An image classifier takes pixel arrays, builds stacked convolutional features, optimises cross-entropy against labels, outputs a class probability, and gets feedback from label correctness. AlphaZero takes a board state, represents it through a value network plus a search tree, optimises against self-play win rate, outputs a move, and gets feedback from the game result. A transformer LLM takes tokens, builds context representations through layered attention, optimises next-token prediction, outputs a probability distribution over the vocabulary. A robot policy takes joint angles and camera frames, represents them as a state vector, optimises expected return, outputs a torque command.
Same skeleton. Different contents.
The shift that produced modern AI is the move from programmed rules to learned representations. A 1990s expert system encoded rules a human wrote (if temperature > 80 and pressure rising then alert). Adding a rule meant a human typing it. The systems didn't scale because the combinatorics defeated their authors. Humans can write 50,000 rules for tax law. Humans cannot write the rules for "what's the next likely token in any English text," and they certainly cannot maintain them as language drifts.
So the field built systems that find their own rules by running data through an objective function and a gradient descent loop. The parameter tensor encodes the rules. Nobody wrote them by hand. You could inspect the tensor values in principle, but you wouldn't know what they encode without further work; that's the field of interpretability, much later in the course.
Hardware enters the story immediately, not as a postscript. A modern intelligence system's behaviour lives in its parameter tensors. Training one involves trillions of multiplications on those tensors, repeatedly. The silicon that can do that quickly was developed for video game graphics. Modern AI exists because, by accident of history, graphics processors turned out to be orders of magnitude faster than CPUs for the specific arithmetic modern AI depends on.
Concrete numbers. A GPT-3 sized model has ~175 billion parameters. At fp16, that's ~350 GB just to store the weights. A single H100 GPU has 80 GB of VRAM. So a model that size needs at least 5 H100s in parallel to hold itself, and the original training run cost roughly $5M in compute (2020 numbers). Every number in that sentence is a hardware constraint shaping what's possible to build.
That constraint pulls on the architecture, not just the cost. The shapes that fit GPUs (lots of matmul, parallelisable, predictable memory access) are the shapes that get built. Shapes that don't fit, like long sequential reasoning chains, get less attention because they don't ride the same hardware wave. You'll see this constraint pulling the field's shape throughout the course.
Today's models sit in the seam where compute, memory, and energy budgets permit them. As all three change, the shape of what's buildable changes too. If photonic compute scales out of labs, matmul might get an order of magnitude cheaper, and architectures we've ruled out today become viable. If neuromorphic chips deliver sparse low-power inference, robotics and edge AI step into territory that battery limits currently hold them out of. None of that is guaranteed. But the field's history is a sequence of hardware shifts redrawing the map roughly every decade.
The right response to this is wonder. The mechanism (parameters tuned by gradients on matmul-friendly silicon) produced behaviour nobody predicted at design time. The wrong response is mystification. The behaviour came from a cause we can name: optimisation against an objective, on hardware that made the objective tractable to optimise. The mechanism is the explanation; the result is the wonder.
Modern AI accumulated, slowly. Parameter-rich models in the 1980s; cheap graphics hardware through the 1990s and 2000s; internet-scale datasets in the 2000s and 2010s; the transformer in 2017; the post-training stack that turned raw pretrained models into useful systems from 2018 onward. Each step answered a constraint the previous step ran into. Nothing about the field appeared from nowhere.
Input โ representation โ optimisation โ output โ feedback, with parameters tuned by data, on hardware that makes the tuning possible. Every later lesson fills in the boxes for a specific kind of system: different representations, different optimisation loops, different hardware constraints, different failure modes, same skeleton.
The outline of the whole field is now in front of you. The rest of the course adds resolution.
Lesson 2: Pattern, prediction, compression. The next station along the bench is the blank page. The connection underneath every intelligence system's representation work is that predicting well and compressing well are the same problem.