Most of the lessons so far framed AI as static prediction: input X, output Y, measure error, update. That frame is enough for image classification, language modelling, and recommender baselines. It breaks the moment a system's outputs change the inputs it will see next.
A robot's policy decides whether to step left or right; the next camera frame depends on that decision. A trading agent's order moves the price; the next observed price depends on the order it placed. A dialogue system's reply shapes the user's next message. In every case the system is no longer a passive function of the world; it is a closed-loop participant in it.
Static prediction can't optimise that directly. The training data the optimiser would need (here's a state, here's the right action) doesn't exist in advance, because the right action depends on the future the action itself produces. That's the engineering corner reinforcement learning was built to handle.
Five terms do the load-bearing work in RL: state, action, reward, environment, policy. A sixth, trajectory, is what they produce when run through time.
A state is whatever the agent needs to know to make its next decision. For a chess agent, the board position. For a robot, joint angles, sensor readings, and recent history. For a dialogue agent, the conversation so far plus internal context. State quality determines policy quality; if the representation omits information that matters, no amount of learning recovers it.
An action is a choice the agent makes given a state (move pawn to e4, turn motor by 0.3 radians, generate token "the"). A reward is a scalar the environment emits in response, usually sparsely. A Go agent's reward is win/lose at the end of a game (one bit). An environment is whatever produces the next state and reward given the current state and action: a simulator, a game, a real robot in a lab, a market, a human in the loop.
A policy is a function from state to action (or to a distribution over actions). It is the compressed behavioural strategy the agent has learned. Train an RL system long enough and what you ship is the policy.
A trajectory is the sequence of state-action-reward tuples produced when a policy interacts with an environment for some number of steps. Trajectories are what the optimiser actually sees and learns from.
In static prediction, training examples are independently drawn. In RL, training examples come from trajectories the current policy generated. The data distribution depends on the policy. Change the policy, get different data.
This is the closed loop that makes RL its own beast. In L1's system loop, RL is the case where the input the system sees on step t+1 is partially produced by the output it produced on step t. The feedback is no longer a thumbs-up on a static answer; it is a thumbs-up on a sequence, evaluated only at the end (or at sparse intermediate points).
When a Go game ends after 200 moves with a win, which moves were responsible? The first one? The last twenty? Some specific tactical sequence in the middle? The reward signal (one bit at the end) gives no per-move breakdown. The optimiser has to spread credit and blame across the trajectory.
This is the credit-assignment problem from L5, treated head-on. It is what makes RL harder than supervised learning at a fundamental level. In supervised learning, the gradient for each example tells you exactly which weights to nudge. In RL, you get a delayed scalar and have to attribute it backwards across hundreds of decisions.
The algorithmic machinery of RL (value functions, advantage estimates, policy gradients with discount factors) exists almost entirely to solve credit assignment. We formalise those later. Here just hold the intuition: the further apart in time the action and the reward, the harder the credit signal is to attribute. Long-horizon problems are not just slower; they are qualitatively different.
A supervised learner is shown what to learn from. An RL agent has to discover what to learn from by trying things. This creates a trade-off with no closed-form answer.
Exploit your current best policy and you collect reliable reward but never find a better one. Explore widely and you collect bad reward while you search. An agent that explores too much never converges. One that exploits too early settles into a poor local optimum.
The trade-off matters because exploration is rarely free. A robot exploring random motor commands can damage itself. A self-driving system exploring random steering can crash. A recommender exploring random recommendations annoys users and bleeds revenue. A dialogue system exploring random replies looks broken. In every real deployment, exploration carries a cost the algorithm has to weigh against the value of what it might find.
This is why simulation matters. In a simulator, exploration is almost free; the only cost is compute. The system runs millions of random trajectories, finds what works, and only then acts in the real world. Modern robotic policies are trained mostly in simulation and fine-tuned on a small amount of real-world data because real-world exploration is the bottleneck.
The horizon is how far into the future the agent's actions consequentially reach. A short-horizon problem has reward arriving within a few steps; a long-horizon one has reward arriving after hundreds or thousands.
Immediate-reward problems (per-frame click prediction in a recommender, single-turn dialogue scoring) are essentially supervised problems wearing a thin RL costume. Each action gets its own reward; credit assignment is trivial. These are easy.
Long-horizon problems (Go, dialogue strategy across many turns, robotic task completion, warehouse routing, industrial control) are where RL becomes genuinely difficult. The agent has to take actions whose value depends on consequences hundreds of steps away. Credit assignment is hard, exploration is expensive, gradient variance grows. Modern reasoning models, which produce long chains of textual deliberation, are tackling a horizon-length problem at the level of intermediate reasoning steps. L70 treats this directly.
Each piece of an RL system creates its own hardware load. The policy and value networks are GPU workloads, identical in shape to deep learning elsewhere. The environment is the hard part. A physics simulator is usually CPU-bound. A game engine is CPU plus GPU but tuned for graphics, not for ML throughput. A real robot runs at wall-clock speed, which is hopelessly slow for training.
The systems trick is to run thousands of environment instances in parallel, batch the experience, feed the GPU at throughput. Half of modern RL infrastructure is plumbing for this. Simulator throughput became its own engineering problem: a simulator that does 1000 environment steps per second per CPU core is far more useful than one that does 100. Compiler optimisation for physics code, JIT-compiled environments, GPU-accelerated rasterisation: all forced by RL's appetite for environment interactions.
Robotics adds another layer. Real-world data collection is bounded by physical time; an hour of robot operation is an hour. So robotics RL leans heavily on simulation for the bulk of training and treats real-world data as the expensive fine-tuning step. Domain randomisation (varying simulator textures, lighting, physics parameters) is the trick that lets sim-trained policies transfer.
Inference latency closes the loop. A vehicle's control policy might have 20-50 ms per decision. A trading policy might have microseconds. The trained policy has to be small and fast at inference, often distilled from a larger network trained offline.
Reinforcement learning from human feedback is RL applied to language models. Worth treating carefully because it sometimes gets framed as something different.
It isn't different. The agent is the language model. Each generated response is a trajectory of tokens. The reward is a learned scalar from a separate reward model trained on human comparisons of model outputs. The policy is updated to produce responses that score higher on the reward model. What changed is the source of the reward, not the mechanics of RL.
The standard RL failure modes follow directly. Reward hacking: the policy finds outputs that score well on the reward model without being what humans would prefer. Sycophancy: the policy learns that agreement reliably increases the score, because the human raters who trained the reward model tended to prefer agreement, even when the user was wrong. The optimiser is doing exactly what the signal asked. The reward is a flawed proxy for what you actually wanted. The post-2022 alignment literature is largely about partially solving it. L52 returns to RLHF after the architecture context is in place.
Figure 6.1 shows the same RL system from three angles. The top panel is the short-horizon loop, agent and environment trading state and action one step at a time. The middle panel widens out to the branching trajectories a policy could choose between, with the explore-vs-exploit choice made visible. The bottom panel turns the trajectory linear again and shows credit assignment: a single reward at the end, attributed back across the decisions that led there.
In L1's loop, RL is the case where the next input is partially produced by the last output. In L2's terms, the policy is a compression of "how to act in this environment". In L3's terms, generalisation in RL means a policy that transfers to environments sharing dynamics with its training one. In L4's terms, emergent strategies appear at scale in RL just as in LLMs; the AlphaZero move-37 example was exactly that. In L5's terms, RL is the regime where the optimiser does not get to choose its training data; the data is whatever the current policy collects.
Static prediction maps an input to an output. Sequential RL maps a trajectory through an environment toward an objective, with the agent's own actions shaping which trajectories are available. The optimisation target is no longer a single label or token; it is the cumulative reward along a path the agent itself drew.
The maze on the wall shows the start cell and the goal cell. The policy is whatever connects them, learned from the agent walking the maze many times and noticing which paths reached the goal.
Lesson 7 sits at the mirror on the bench (station 7) and looks directly at the thing every system in this course depends on: representation, the internal form of state, prediction, and policy that the optimiser actually works against.