Build Track · B1

Tabular Q-learning agent on gridworld

Build a small reinforcement-learning agent that learns to cross a 5×5 grid to a goal, using a literal Q-table and the Q-learning update rule, written by hand in numpy. No neural network, no RL library. You watch value spread backward from the goal, one cell per episode.

Phase: 1, after L6 Time: ~3 hours Tier: 1 (any laptop, CPU) Tooling: numpy and a plotting library Status: optional (depth-by-choice)
where this sits B1 is the Build Track's first hands-on build, attached to L6 (Reinforcement learning and sequential decision-making). It is optional and not required to continue the course. Do it now, do it later, or skip it and carry on with the lessons.
before you start B1 assumes you can run a Python script and install numpy (matplotlib is optional). If any of that is new, the Tooling references have you covered: Python setup, Running Python, Installing packages, and Reading errors. New to Python itself, but you already program in another language? Read Python basics once, then keep the Python cheatsheet open while you build. They are optional setup help, not part of the build.

Summary

You build a reinforcement-learning agent that learns to cross a 5×5 grid to a goal, using a literal Q-table and the Q-learning update rule, by hand in numpy. The point is to make L6's vocabulary physical: state, action, reward, and policy become array indices and numbers you can print; exploration versus exploitation becomes a single if-statement; and temporal credit assignment becomes value you watch spread backward from the goal, one cell per episode. When it works, a printed grid of arrows shows the shortest path the agent found, and a steps-to-goal curve falls and flattens.

Learning goals

Prerequisites

Estimated time

About 3 hours: roughly 1 hour on the environment, 1 hour on the agent and training loop, and 1 hour reading the output and getting the policy to converge. If it runs well past that, stop and re-read the L6 section on the update rule, which is the usual sticking point.

Deliverables

Step-by-step instructions

  1. Build the grid. Define a 5×5 layout with a start cell, a goal cell, and three or four obstacle cells. Decide your state encoding (row-major flatten to 0 through 24 is simplest).
  2. Write the environment step. Given a cell and an action (up, down, left, right), return the next cell, a reward, and a done flag. Moving off-grid or into an obstacle keeps the agent in place. Reaching the goal returns a positive reward and done = True. Every other step returns a small negative reward, so shorter paths score higher.
  3. Create the Q-table. A numpy array of shape (25, 4), initialised to zeros. Rows are states, columns are actions.
  4. Write epsilon-greedy selection. With probability epsilon, return a random action; otherwise return the argmax over the current state's row. This one function is the whole exploration-versus-exploitation idea.
  5. Write the training loop. For each episode: reset to the start, then repeatedly choose an action, step the environment, apply the update, and move to the next cell until the goal is reached or a step cap is hit. Decay epsilon from near 1.0 toward about 0.05 across the run.
  6. Apply the update by hand. The heart of the build: Q[s, a] += alpha * (target - Q[s, a]), where target = r at a terminal step and target = r + gamma * max(Q[s_next]) otherwise. Use roughly alpha = 0.1, gamma = 0.95.
  7. Record progress. Append steps-to-goal (or total reward) per episode to a list.
  8. Visualise the policy. Print the grid with an arrow in each cell pointing to its greedy action, and mark the start, goal, and obstacles. Plot or print the progress list.
  9. Read the result. Confirm the greedy policy traces a sensible short path, and that the progress curve falls and flattens. Compare the Q-table early versus late and find where value spread backward from the goal.
  10. Write the README. Explain the update rule in your own words, and note what surprised you.

Suggested file structure

builds/B1/
  q_learning.py     # environment + Q-table + epsilon-greedy + train loop + policy print
  progress.png      # steps-to-goal per episode (or print to console instead)
  README.md         # what you built, what you learned, gotchas, numbers observed

One file is plenty at this size (the milestone is roughly 80 lines). Splitting the environment and agent into separate files is fine but not required.

Starter skeleton

The two lines that carry the learning are left for you to write. Everything around them is scaffolding. Writing those two lines yourself is the milestone.

import numpy as np

# --- environment: 5x5 gridworld ------------------------------------
GRID = (5, 5)
START, GOAL = (4, 0), (0, 4)
OBSTACLES = {(1, 1), (2, 2), (3, 1)}
ACTIONS = [(-1, 0), (1, 0), (0, -1), (0, 1)]   # up, down, left, right
N_STATES, N_ACTIONS = GRID[0] * GRID[1], len(ACTIONS)

def to_state(cell):
    return cell[0] * GRID[1] + cell[1]

def step(cell, a):
    dr, dc = ACTIONS[a]
    nxt = (cell[0] + dr, cell[1] + dc)
    on_grid = 0 <= nxt[0] < GRID[0] and 0 <= nxt[1] < GRID[1]
    if not on_grid or nxt in OBSTACLES:
        nxt = cell                       # illegal move: stay put
    if nxt == GOAL:
        return nxt, 1.0, True            # reaching the goal ends the episode
    return nxt, -0.01, False             # small step cost rewards short paths

# --- agent ---------------------------------------------------------
Q = np.zeros((N_STATES, N_ACTIONS))
alpha, gamma = 0.1, 0.95
episodes, max_steps = 500, 100

def choose(s, epsilon):
    # TODO: with prob epsilon return a random action (explore),
    # otherwise return int(np.argmax(Q[s])) (exploit).
    ...

for ep in range(episodes):
    epsilon = max(0.05, 1.0 - ep / (0.8 * episodes))   # explore early, exploit late
    cell = START
    for t in range(max_steps):
        s = to_state(cell)
        a = choose(s, epsilon)
        nxt, r, done = step(cell, a)
        s2 = to_state(nxt)
        # TODO: the Q-learning update.
        #   target = r if done else r + gamma * np.max(Q[s2])
        #   Q[s, a] += alpha * (target - Q[s, a])
        ...
        cell = nxt
        if done:
            break
# then: print the greedy policy as arrows, and plot steps-to-goal per episode

Expected output

After training, printing the greedy policy gives a grid of arrows that trace a sensible short path from start (S) to goal (G), routing around the obstacles (#). Your exact arrows, path, and obstacle layout will differ with the seed and your choices; this is one plausible result, not a target to match:

 →  →  →  →  G
 ↑  #  ↑  →  ↑
 ↑  →  #  →  ↑
 ↑  #  →  →  ↑
 S  →  →  →  ↑

The progress curve tells the same story over time. Early episodes take many steps to reach the goal and the count is noisy, because the agent is mostly exploring. As epsilon decays and the Q-values settle, steps-to-goal falls and then flattens near the length of a short path. Exact episode counts, final path length, and the shape of the early noise vary by run; what matters is the trend: high and noisy, then falling, then flat. If you print the Q-table early versus late, you should see value first appear in cells next to the goal, then spread outward across episodes.

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.

COMPLETE The agent trains, the greedy policy traces a sensible short path to the goal, and the steps-to-goal curve falls and flattens. You can explain, in the README and out loud, what each term of the update does (the reward, the discounted best next value, the difference being corrected), why epsilon decays, and where you can see value having propagated backward from the goal. The update and the table are your own numpy.
RUNS-NOT-UNDERSTOOD It converges, but you can't yet explain the update rule, or why the max(Q[s_next]) term carries value backward, or why epsilon decay matters. The fix is to re-read L6 and trace the update by hand on a single transition near the goal. Do not mark COMPLETE.
TOOL-LOCKED It works only because you used an RL library or a framework's built-in Q-learning rather than writing the table and the update yourself. The milestone's whole purpose is the visible mechanism, so reframe it to numpy by hand before marking complete.
INCOMPLETE Unfinished, or the agent won't converge and you haven't found the bug yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, the places understanding slips even when the code runs.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

SymptomLikely cause
ModuleNotFoundError: No module named 'numpy'numpy is not installed in the Python you are running. This is a setup issue, not a Q-learning one. See Tooling: Installing packages and Reading errors.
Agent never reaches the goalThe goal step does not return done = True, or max_steps is too small to cross the grid, or epsilon starts near zero so the agent never explores a path in.
Policy stays random-looking after trainingThe update writes to the wrong index (using s2 where s belongs, or vice versa), or epsilon never decays so the greedy policy is never exercised. Print Q for a cell next to the goal; it should be clearly nonzero.
Values explode or go to NaNBootstrapping past the terminal (adding gamma * max(Q[s_next]) at the goal), or gamma at 1.0 with a positive reward in a loop, or alpha too high. Ensure target = r at terminal and lower alpha.
Convergence extremely slowEpsilon decays too slowly so the agent keeps exploring, too few episodes, or the path is long. Raise the episode count or bring the epsilon floor down sooner.
Agent oscillates between two cellsNo step cost, so nothing rewards progress, or argmax always breaks ties the same way. Add a small negative step reward and break ties randomly.
Arrows point into walls or off-gridA state-indexing bug in to_state (row and column swapped, or off by one). Print to_state for a few known cells and check against the grid.

Optional extensions

why this build exists L6 gives you the vocabulary of reinforcement learning; B1 puts it in your hands. State, action, reward, and policy stop being words and become a cell index, one of four moves, a number, and an argmax over a table you can print. Exploration versus exploitation is one if-statement you wrote. Temporal credit assignment is value visibly creeping backward from the goal, episode by episode, which is the cleanest answer to how it learns without being told the answer. Q-learning (Watkins, 1989) is the most durable way to feel this, because there is no neural network in the way: a table and one update rule, as valid in 2050 as now. It is also the first build where you write a learning loop by hand, the same forward, observe, update shape that returns at every later training milestone. Done honestly, it dissolves the mysticism: "the agent learned the maze" becomes "argmax over a table of numbers shaped by a feedback rule."