Build Track · B1

Tabular Q-learning agent on gridworld

Build a small reinforcement-learning agent that learns to cross a 5×5 grid to a goal, using a literal Q-table and the Q-learning update rule, written by hand in numpy. No neural network, no RL library. You watch value spread backward from the goal, one cell per episode.

Phase: 1, after L6 Time: ~3 hours Tier: 1 (any laptop, CPU) Tooling: numpy and a plotting library Status: optional (depth-by-choice)

where this sits B1 is the Build Track's first hands-on build, attached to L6 (Reinforcement learning and sequential decision-making). It is optional and not required to continue the course. Do it now, do it later, or skip it and carry on with the lessons.

before you start B1 assumes you can run a Python script and install numpy (matplotlib is optional). If any of that is new, the Tooling references have you covered: Python setup, Running Python, Installing packages, and Reading errors. New to Python itself, but you already program in another language? Read Python basics once, then keep the Python cheatsheet open while you build. They are optional setup help, not part of the build.

Summary

You build a reinforcement-learning agent that learns to cross a 5×5 grid to a goal, using a literal Q-table and the Q-learning update rule, by hand in numpy. The point is to make L6's vocabulary physical: state, action, reward, and policy become array indices and numbers you can print; exploration versus exploitation becomes a single if-statement; and temporal credit assignment becomes value you watch spread backward from the goal, one cell per episode. When it works, a printed grid of arrows shows the shortest path the agent found, and a steps-to-goal curve falls and flattens.

Learning goals

See state, action, reward, and policy as concrete objects: a cell index, one of four moves, a number, and an argmax over a table.
Implement epsilon-greedy action selection and watch the explore-then-exploit trade play out as epsilon decays.
Write the Q-learning update by hand and be able to say what each term does.
Watch temporal credit assignment happen: value propagates backward from the goal across episodes, with no part of the code knowing the goal in advance.
Leave able to explain, without mysticism, that the learned policy is argmax over a table of numbers shaped by a feedback rule.

Prerequisites

L6 read, so state, action, reward, policy, exploration versus exploitation, and temporal credit assignment are familiar.
Comfort reading short numpy. No prior RL coding needed. If installing a package or running a script is new, see the Tooling references linked above. Depth-by-choice: skipping this does not block conceptual progress.

Estimated time

About 3 hours: roughly 1 hour on the environment, 1 hour on the agent and training loop, and 1 hour reading the output and getting the policy to converge. If it runs well past that, stop and re-read the L6 section on the update rule, which is the usual sticking point.

Deliverables

q_learning.py: the gridworld, the Q-table, epsilon-greedy selection, the training loop, and a function that prints the greedy policy as arrows.
One progress artefact: a saved plot or printed table of steps-to-goal per episode (or total reward per episode), showing the curve settle.
README.md: what you built, what you learned (the update rule in your own words), the gotchas you hit, and the numbers you observed (episodes to converge, final path length).

Step-by-step instructions

Build the grid. Define a 5×5 layout with a start cell, a goal cell, and three or four obstacle cells. Decide your state encoding (row-major flatten to 0 through 24 is simplest).
Write the environment step. Given a cell and an action (up, down, left, right), return the next cell, a reward, and a done flag. Moving off-grid or into an obstacle keeps the agent in place. Reaching the goal returns a positive reward and done = True. Every other step returns a small negative reward, so shorter paths score higher.
Create the Q-table. A numpy array of shape (25, 4), initialised to zeros. Rows are states, columns are actions.
Write epsilon-greedy selection. With probability epsilon, return a random action; otherwise return the argmax over the current state's row. This one function is the whole exploration-versus-exploitation idea.
Write the training loop. For each episode: reset to the start, then repeatedly choose an action, step the environment, apply the update, and move to the next cell until the goal is reached or a step cap is hit. Decay epsilon from near 1.0 toward about 0.05 across the run.
Apply the update by hand. The heart of the build: Q[s, a] += alpha * (target - Q[s, a]), where target = r at a terminal step and target = r + gamma * max(Q[s_next]) otherwise. Use roughly alpha = 0.1, gamma = 0.95.
Record progress. Append steps-to-goal (or total reward) per episode to a list.
Visualise the policy. Print the grid with an arrow in each cell pointing to its greedy action, and mark the start, goal, and obstacles. Plot or print the progress list.
Read the result. Confirm the greedy policy traces a sensible short path, and that the progress curve falls and flattens. Compare the Q-table early versus late and find where value spread backward from the goal.
Write the README. Explain the update rule in your own words, and note what surprised you.

Suggested file structure

builds/B1/
  q_learning.py     # environment + Q-table + epsilon-greedy + train loop + policy print
  progress.png      # steps-to-goal per episode (or print to console instead)
  README.md         # what you built, what you learned, gotchas, numbers observed

One file is plenty at this size (the milestone is roughly 80 lines). Splitting the environment and agent into separate files is fine but not required.

Starter skeleton

The two lines that carry the learning are left for you to write. Everything around them is scaffolding. Writing those two lines yourself is the milestone.

import numpy as np

# --- environment: 5x5 gridworld ------------------------------------
GRID = (5, 5)
START, GOAL = (4, 0), (0, 4)
OBSTACLES = {(1, 1), (2, 2), (3, 1)}
ACTIONS = [(-1, 0), (1, 0), (0, -1), (0, 1)]   # up, down, left, right
N_STATES, N_ACTIONS = GRID[0] * GRID[1], len(ACTIONS)

def to_state(cell):
    return cell[0] * GRID[1] + cell[1]

def step(cell, a):
    dr, dc = ACTIONS[a]
    nxt = (cell[0] + dr, cell[1] + dc)
    on_grid = 0 <= nxt[0] < GRID[0] and 0 <= nxt[1] < GRID[1]
    if not on_grid or nxt in OBSTACLES:
        nxt = cell                       # illegal move: stay put
    if nxt == GOAL:
        return nxt, 1.0, True            # reaching the goal ends the episode
    return nxt, -0.01, False             # small step cost rewards short paths

# --- agent ---------------------------------------------------------
Q = np.zeros((N_STATES, N_ACTIONS))
alpha, gamma = 0.1, 0.95
episodes, max_steps = 500, 100

def choose(s, epsilon):
    # TODO: with prob epsilon return a random action (explore),
    # otherwise return int(np.argmax(Q[s])) (exploit).
    ...

for ep in range(episodes):
    epsilon = max(0.05, 1.0 - ep / (0.8 * episodes))   # explore early, exploit late
    cell = START
    for t in range(max_steps):
        s = to_state(cell)
        a = choose(s, epsilon)
        nxt, r, done = step(cell, a)
        s2 = to_state(nxt)
        # TODO: the Q-learning update.
        #   target = r if done else r + gamma * np.max(Q[s2])
        #   Q[s, a] += alpha * (target - Q[s, a])
        ...
        cell = nxt
        if done:
            break
# then: print the greedy policy as arrows, and plot steps-to-goal per episode

Expected output

After training, printing the greedy policy gives a grid of arrows that trace a sensible short path from start (S) to goal (G), routing around the obstacles (#). Your exact arrows, path, and obstacle layout will differ with the seed and your choices; this is one plausible result, not a target to match:

 →  →  →  →  G
 ↑  #  ↑  →  ↑
 ↑  →  #  →  ↑
 ↑  #  →  →  ↑
 S  →  →  →  ↑

The progress curve tells the same story over time. Early episodes take many steps to reach the goal and the count is noisy, because the agent is mostly exploring. As epsilon decays and the Q-values settle, steps-to-goal falls and then flattens near the length of a short path. Exact episode counts, final path length, and the shape of the early noise vary by run; what matters is the trend: high and noisy, then falling, then flat. If you print the Q-table early versus late, you should see value first appear in cells next to the goal, then spread outward across episodes.

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.

COMPLETE The agent trains, the greedy policy traces a sensible short path to the goal, and the steps-to-goal curve falls and flattens. You can explain, in the README and out loud, what each term of the update does (the reward, the discounted best next value, the difference being corrected), why epsilon decays, and where you can see value having propagated backward from the goal. The update and the table are your own numpy.

RUNS-NOT-UNDERSTOOD It converges, but you can't yet explain the update rule, or why the max(Q[s_next]) term carries value backward, or why epsilon decay matters. The fix is to re-read L6 and trace the update by hand on a single transition near the goal. Do not mark COMPLETE.

TOOL-LOCKED It works only because you used an RL library or a framework's built-in Q-learning rather than writing the table and the update yourself. The milestone's whole purpose is the visible mechanism, so reframe it to numpy by hand before marking complete.

INCOMPLETE Unfinished, or the agent won't converge and you haven't found the bug yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, the places understanding slips even when the code runs.

Bootstrapping past the terminal state. At the goal the target is just r, with no gamma * max(Q[s_next]) term. Forgetting this leaks value past the goal and distorts the table. The single most common Q-learning error.
Misreading epsilon decay. Constant high epsilon leaves the policy noisy because the agent keeps exploring; decaying to zero too early locks in the first path found, which may not be shortest.
Discount and step cost. Without a discount below 1 or a small per-step penalty, the agent has no incentive for a short path and values can go flat or degenerate.
Reading the policy plot loosely. Arrows in obstacle or goal cells do not matter, and argmax ties can point either way. Read the path, not every cell.
Treating "it converged" as "I understand it." This is exactly the COMPLETE versus RUNS-NOT-UNDERSTOOD line. Converging is necessary, not sufficient.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

Symptom	Likely cause
`ModuleNotFoundError: No module named 'numpy'`	numpy is not installed in the Python you are running. This is a setup issue, not a Q-learning one. See Tooling: Installing packages and Reading errors.
Agent never reaches the goal	The goal step does not return `done = True`, or `max_steps` is too small to cross the grid, or epsilon starts near zero so the agent never explores a path in.
Policy stays random-looking after training	The update writes to the wrong index (using `s2` where `s` belongs, or vice versa), or epsilon never decays so the greedy policy is never exercised. Print `Q` for a cell next to the goal; it should be clearly nonzero.
Values explode or go to NaN	Bootstrapping past the terminal (adding `gamma * max(Q[s_next])` at the goal), or `gamma` at 1.0 with a positive reward in a loop, or `alpha` too high. Ensure `target = r` at terminal and lower `alpha`.
Convergence extremely slow	Epsilon decays too slowly so the agent keeps exploring, too few episodes, or the path is long. Raise the episode count or bring the epsilon floor down sooner.
Agent oscillates between two cells	No step cost, so nothing rewards progress, or argmax always breaks ties the same way. Add a small negative step reward and break ties randomly.
Arrows point into walls or off-grid	A state-indexing bug in `to_state` (row and column swapped, or off by one). Print `to_state` for a few known cells and check against the grid.

Optional extensions

Make it slippery. With some probability the chosen action slips to a neighbour. Watch the policy grow more cautious around obstacles. Your first taste of a stochastic environment.
Q-learning versus SARSA. Add the on-policy SARSA update and compare on a cliff-edge variant. Off-policy Q-learning prefers the optimal-but-risky path; SARSA prefers the safe one.
Animate the value function. Plot max(Q, axis=1) reshaped to the grid every few episodes and watch value fill outward from the goal.
Sweep gamma and the step cost. Observe how path length and convergence speed change.
Grow the grid. Make it 10×10 with more obstacles and watch episodes-to-converge climb. That growth is the quiet argument for why tabular methods stop scaling.

why this build exists L6 gives you the vocabulary of reinforcement learning; B1 puts it in your hands. State, action, reward, and policy stop being words and become a cell index, one of four moves, a number, and an argmax over a table you can print. Exploration versus exploitation is one if-statement you wrote. Temporal credit assignment is value visibly creeping backward from the goal, episode by episode, which is the cleanest answer to how it learns without being told the answer. Q-learning (Watkins, 1989) is the most durable way to feel this, because there is no neural network in the way: a table and one update rule, as valid in 2050 as now. It is also the first build where you write a learning loop by hand, the same forward, observe, update shape that returns at every later training milestone. Done honestly, it dissolves the mysticism: "the agent learned the maze" becomes "argmax over a table of numbers shaped by a feedback rule."

← L6 · Reinforcement learning Syllabus Continue to L7 →