PHASE 1 · FOUNDATIONS OF INTELLIGENCE
06 / 78

Reinforcement learning and sequential decision-making

Lesson 6. Phase 1: Foundations of intelligence. ~25 min read + cards + retrieval. Durability tier 1 (bedrock).

🧭
Memory palace · Bench · station 6
The maze on the wall. A printed gridworld. You see the start cell and the goal cell. Trajectories, branching futures, delayed consequences, and a policy as the compressed navigation strategy through time.
Core idea. Reinforcement learning is the optimisation of behaviour through interaction over time, where actions change future states and rewards arrive only after trajectories unfold.

Why this lesson exists

Most of the lessons so far framed AI as static prediction: input X, output Y, measure error, update. That frame is enough for image classification, language modelling, and recommender baselines. It breaks the moment a system's outputs change the inputs it will see next.

A robot's policy decides whether to step left or right; the next camera frame depends on that decision. A trading agent's order moves the price; the next observed price depends on the order it placed. A dialogue system's reply shapes the user's next message. In every case the system is no longer a passive function of the world; it is a closed-loop participant in it.

Static prediction can't optimise that directly. The training data the optimiser would need (here's a state, here's the right action) doesn't exist in advance, because the right action depends on the future the action itself produces. That's the engineering corner reinforcement learning was built to handle.

The core actors

Five terms do the load-bearing work in RL: state, action, reward, environment, policy. A sixth, trajectory, is what they produce when run through time.

A state is whatever the agent needs to know to make its next decision. For a chess agent, the board position. For a robot, joint angles, sensor readings, and recent history. For a dialogue agent, the conversation so far plus internal context. State quality determines policy quality; if the representation omits information that matters, no amount of learning recovers it.

An action is a choice the agent makes given a state (move pawn to e4, turn motor by 0.3 radians, generate token "the"). A reward is a scalar the environment emits in response, usually sparsely. A Go agent's reward is win/lose at the end of a game (one bit). An environment is whatever produces the next state and reward given the current state and action: a simulator, a game, a real robot in a lab, a market, a human in the loop.

A policy is a function from state to action (or to a distribution over actions). It is the compressed behavioural strategy the agent has learned. Train an RL system long enough and what you ship is the policy.

A trajectory is the sequence of state-action-reward tuples produced when a policy interacts with an environment for some number of steps. Trajectories are what the optimiser actually sees and learns from.

Sequential vs static

In static prediction, training examples are independently drawn. In RL, training examples come from trajectories the current policy generated. The data distribution depends on the policy. Change the policy, get different data.

This is the closed loop that makes RL its own beast. In L1's system loop, RL is the case where the input the system sees on step t+1 is partially produced by the output it produced on step t. The feedback is no longer a thumbs-up on a static answer; it is a thumbs-up on a sequence, evaluated only at the end (or at sparse intermediate points).

Credit assignment through time

When a Go game ends after 200 moves with a win, which moves were responsible? The first one? The last twenty? Some specific tactical sequence in the middle? The reward signal (one bit at the end) gives no per-move breakdown. The optimiser has to spread credit and blame across the trajectory.

This is the credit-assignment problem from L5, treated head-on. It is what makes RL harder than supervised learning at a fundamental level. In supervised learning, the gradient for each example tells you exactly which weights to nudge. In RL, you get a delayed scalar and have to attribute it backwards across hundreds of decisions.

The algorithmic machinery of RL (value functions, advantage estimates, policy gradients with discount factors) exists almost entirely to solve credit assignment. We formalise those later. Here just hold the intuition: the further apart in time the action and the reward, the harder the credit signal is to attribute. Long-horizon problems are not just slower; they are qualitatively different.

Exploration vs exploitation

A supervised learner is shown what to learn from. An RL agent has to discover what to learn from by trying things. This creates a trade-off with no closed-form answer.

Exploit your current best policy and you collect reliable reward but never find a better one. Explore widely and you collect bad reward while you search. An agent that explores too much never converges. One that exploits too early settles into a poor local optimum.

The trade-off matters because exploration is rarely free. A robot exploring random motor commands can damage itself. A self-driving system exploring random steering can crash. A recommender exploring random recommendations annoys users and bleeds revenue. A dialogue system exploring random replies looks broken. In every real deployment, exploration carries a cost the algorithm has to weigh against the value of what it might find.

This is why simulation matters. In a simulator, exploration is almost free; the only cost is compute. The system runs millions of random trajectories, finds what works, and only then acts in the real world. Modern robotic policies are trained mostly in simulation and fine-tuned on a small amount of real-world data because real-world exploration is the bottleneck.

Horizon length

The horizon is how far into the future the agent's actions consequentially reach. A short-horizon problem has reward arriving within a few steps; a long-horizon one has reward arriving after hundreds or thousands.

Immediate-reward problems (per-frame click prediction in a recommender, single-turn dialogue scoring) are essentially supervised problems wearing a thin RL costume. Each action gets its own reward; credit assignment is trivial. These are easy.

Long-horizon problems (Go, dialogue strategy across many turns, robotic task completion, warehouse routing, industrial control) are where RL becomes genuinely difficult. The agent has to take actions whose value depends on consequences hundreds of steps away. Credit assignment is hard, exploration is expensive, gradient variance grows. Modern reasoning models, which produce long chains of textual deliberation, are tackling a horizon-length problem at the level of intermediate reasoning steps. L70 treats this directly.

Hardware and infrastructure

Each piece of an RL system creates its own hardware load. The policy and value networks are GPU workloads, identical in shape to deep learning elsewhere. The environment is the hard part. A physics simulator is usually CPU-bound. A game engine is CPU plus GPU but tuned for graphics, not for ML throughput. A real robot runs at wall-clock speed, which is hopelessly slow for training.

The systems trick is to run thousands of environment instances in parallel, batch the experience, feed the GPU at throughput. Half of modern RL infrastructure is plumbing for this. Simulator throughput became its own engineering problem: a simulator that does 1000 environment steps per second per CPU core is far more useful than one that does 100. Compiler optimisation for physics code, JIT-compiled environments, GPU-accelerated rasterisation: all forced by RL's appetite for environment interactions.

Robotics adds another layer. Real-world data collection is bounded by physical time; an hour of robot operation is an hour. So robotics RL leans heavily on simulation for the bulk of training and treats real-world data as the expensive fine-tuning step. Domain randomisation (varying simulator textures, lighting, physics parameters) is the trick that lets sim-trained policies transfer.

Inference latency closes the loop. A vehicle's control policy might have 20-50 ms per decision. A trading policy might have microseconds. The trained policy has to be small and fast at inference, often distilled from a larger network trained offline.

RLHF

Reinforcement learning from human feedback is RL applied to language models. Worth treating carefully because it sometimes gets framed as something different.

It isn't different. The agent is the language model. Each generated response is a trajectory of tokens. The reward is a learned scalar from a separate reward model trained on human comparisons of model outputs. The policy is updated to produce responses that score higher on the reward model. What changed is the source of the reward, not the mechanics of RL.

The standard RL failure modes follow directly. Reward hacking: the policy finds outputs that score well on the reward model without being what humans would prefer. Sycophancy: the policy learns that agreement reliably increases the score, because the human raters who trained the reward model tended to prefer agreement, even when the user was wrong. The optimiser is doing exactly what the signal asked. The reward is a flawed proxy for what you actually wanted. The post-2022 alignment literature is largely about partially solving it. L52 returns to RLHF after the architecture context is in place.

Three views of an RL system

Figure 6.1 shows the same RL system from three angles. The top panel is the short-horizon loop, agent and environment trading state and action one step at a time. The middle panel widens out to the branching trajectories a policy could choose between, with the explore-vs-exploit choice made visible. The bottom panel turns the trajectory linear again and shows credit assignment: a single reward at the end, attributed back across the decisions that led there.

panel 1 · one decision at a time (short horizon) s₀ s₁ s₂ s₃ a₀ a₁ a₂ r₀ r₁ r₂ continues... agent observes state, takes action, environment returns next state and reward · this loop is the substrate of every RL system panel 2 · branching trajectories (policy chooses which path to walk) s₀ A B C +8 +2 +1 +5? 0 ─── exploit: best known path so far ─ ─ explore: less data, possibly better reward at each leaf shows what was found (or guessed) so far for that trajectory cost of exploration: opportunity cost, and real-world risk if the env is live policy choice over branches shapes which trajectories the optimiser ever sees panel 3 · delayed reward (credit assignment across the trajectory) s₀ s₁ s₂ s₃ s₄ s₅ s₆ s₇ R one scalar, arrives at end credit attributed backwards from final reward, fading with distance long horizons → harder credit assignment → more sophisticated algorithms required actions reshape future inputs · reward arrives only after trajectories unfold
FIG 6.1. Three views of the same RL system. Top: the agent-environment loop, one step at a time. Middle: the branching futures a policy could choose between, with exploit and explore highlighted. Bottom: a single delayed reward, attributed back across the trajectory that produced it. Together: actions reshape future inputs, and reward arrives only after trajectories unfold.

The L1 to L5 view

In L1's loop, RL is the case where the next input is partially produced by the last output. In L2's terms, the policy is a compression of "how to act in this environment". In L3's terms, generalisation in RL means a policy that transfers to environments sharing dynamics with its training one. In L4's terms, emergent strategies appear at scale in RL just as in LLMs; the AlphaZero move-37 example was exactly that. In L5's terms, RL is the regime where the optimiser does not get to choose its training data; the data is whatever the current policy collects.

The takeaway

Static prediction maps an input to an output. Sequential RL maps a trajectory through an environment toward an objective, with the agent's own actions shaping which trajectories are available. The optimisation target is no longer a single label or token; it is the cumulative reward along a path the agent itself drew.

The maze on the wall shows the start cell and the goal cell. The policy is whatever connects them, learned from the agent walking the maze many times and noticing which paths reached the goal.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L6 Compare 3 ways to build a system that decides which news article to surface to a user: (a) static supervised prediction (predict click probability per article), (b) full RL on long-horizon user satisfaction, and (c) a feedback-loop system that updates a click-prediction model online from user behaviour. For each, describe the optimisation signal, the trajectory shape, and one failure mode you would expect in production.
(a) Static supervised. Signal: per-impression click label, dense and immediate. Trajectory shape: none; each impression is treated independently. The model learns p(click | article, user_features). Failure mode: clickbait wins. The signal doesn't see the next 10 minutes of user satisfaction, only the immediate click, so any feature that drives clicks at the expense of satisfaction gets reinforced. Whatever the click-prediction model learns is what the system surfaces, regardless of downstream consequences. (b) Full RL. Signal: long-horizon reward like daily-active-user retention or self-reported satisfaction at session end. Trajectory shape: a sequence of article impressions per user across many sessions; the policy's choice on impression t affects which articles are available and relevant for impression t+1 (through staleness, fatigue, taste shift). Failure mode: sample inefficiency and credit assignment. Long-horizon reward attributable across many decisions is hard to learn; the system needs huge amounts of user-session data and may converge slowly. Reward hacking is also a risk: a policy could learn to keep users on the platform via mechanisms that are addictive without being good. (c) Feedback-loop online learning. Signal: the supervised click label, but the training distribution shifts continuously because the model itself decides which articles users see. Trajectory shape: not formalised, but a real loop exists. The model picks an article, the user clicks or doesn't, the click label updates the model, the updated model picks next time. Failure mode: filter bubble collapse. The model surfaces what its current beliefs say users will click; users click those; the model learns "users like X" more strongly; surfaces more X; etc. This is RL in disguise without explicit horizon or value estimation, so the failure modes of long-horizon optimisation appear without the algorithmic machinery for handling them. Production recommender systems usually combine all 3 with careful guardrails.
L6 A robotics team trains a quadruped to walk in simulation using RL. The policy works beautifully in sim and fails to transfer to the real robot. Without using any RL maths, explain mechanistically why this happens in terms of state representation, exploration, horizon, and reward shape. What 3 mechanisms would you reach for to bridge the gap, and what trade-off does each carry?
The sim policy is optimised against a specific environment: a specific physics engine, specific friction coefficients, specific actuator response curves, specific sensor noise profile, specific camera characteristics. The trained policy's state representation encodes features that worked under exactly that environment; some of those features will be specific to simulator artefacts (numerical integration tics, friction model idealisations) that the real robot doesn't reproduce. Exploration in sim was free, so the policy may have settled into a basin that depends on regions of state space that the real robot's actuators or sensors can't actually produce reliably. The horizon was long enough to develop strategies that depend on specific multi-step interactions; small per-step differences between sim and real compound across the trajectory and the policy ends up in states it never saw. Reward shape may have rewarded behaviours that sim incentivised cheaply (sliding instead of walking, exploiting collision modelling glitches) and the real robot can't replicate. Three bridging mechanisms: (1) domain randomisation. Vary simulator parameters during training (friction, masses, latencies, sensor noise) so the policy learns features stable across the variation rather than tuned to one sim. Trade-off: the policy becomes more conservative and may underperform when the real robot is well-characterised. (2) Real-world fine-tuning. Run the trained sim policy on the real robot, collect a small amount of real-data trajectories, fine-tune. Trade-off: real-world exploration during fine-tuning is dangerous (can damage the robot) and expensive (wall-clock bounded). (3) Stronger inductive bias in state and action representation. Use state features that are invariant by construction (joint angles in body coordinate frame, contact forces normalised against weight) and action representations that respect physical limits. Trade-off: design effort up front, and may underexpress strategies that depend on capturing sim-specific cues that happen also to be real-world useful. All 3 are sequential-optimisation versions of L3's distribution shift problem; the policy is "distribution-fit" to the simulator, and the real robot is the shifted distribution.
↳ L7 (Forward interleave to L7, representation.) State is what the agent needs to know to make a good decision. Representation is how that state is encoded inside the system. Sketch why state representation choice is a load-bearing piece of any RL system, and what kinds of representations work well for short-horizon vs long-horizon problems. Lightly: what would you change in the representation if you needed the policy to plan further ahead?
State representation is load-bearing because the policy is a function of state. Whatever information the representation omits, the policy cannot use, no matter how much training is done. A state that contains only the current sensor reading lets the policy react reactively but loses any historical context. A state that includes a short window of recent observations supports short-horizon reasoning (the agent can detect velocities, current trends, immediate predictions). A state that includes a much longer summary (a learned recurrent or attention-based embedding of the entire trajectory) supports long-horizon planning, but is harder to learn and increases inference cost. For short-horizon problems (collision avoidance, reflexive control), a few recent observations are usually enough; the policy just needs to handle the immediate few steps. For long-horizon problems (route planning, dialogue strategy, multi-step manipulation), the representation needs to summarise enough of the past to predict consequences many steps out. If the goal is to plan further ahead, the representation needs to encode predictive structure: not just where I am, but what I expect to happen, and what I have already tried. Modern RL approaches do this via learned world models (a network that predicts the next state given current state and action), and by giving the policy access to that predictive structure. This is exactly what L7 starts to formalise: representation as the internal language the system uses to describe what it is doing, and the choice that downstream decisions are forced to live with.

Next station

Lesson 7 sits at the mirror on the bench (station 7) and looks directly at the thing every system in this course depends on: representation, the internal form of state, prediction, and policy that the optimiser actually works against.