Build a small reinforcement-learning agent that learns to cross a 5×5 grid to a goal, using a literal Q-table and the Q-learning update rule, written by hand in numpy. No neural network, no RL library. You watch value spread backward from the goal, one cell per episode.
You build a reinforcement-learning agent that learns to cross a 5×5 grid to a goal, using a literal Q-table and the Q-learning update rule, by hand in numpy. The point is to make L6's vocabulary physical: state, action, reward, and policy become array indices and numbers you can print; exploration versus exploitation becomes a single if-statement; and temporal credit assignment becomes value you watch spread backward from the goal, one cell per episode. When it works, a printed grid of arrows shows the shortest path the agent found, and a steps-to-goal curve falls and flattens.
About 3 hours: roughly 1 hour on the environment, 1 hour on the agent and training loop, and 1 hour reading the output and getting the policy to converge. If it runs well past that, stop and re-read the L6 section on the update rule, which is the usual sticking point.
q_learning.py: the gridworld, the Q-table, epsilon-greedy selection, the training loop, and a function that prints the greedy policy as arrows.README.md: what you built, what you learned (the update rule in your own words), the gotchas you hit, and the numbers you observed (episodes to converge, final path length).done = True. Every other step returns a small negative reward, so shorter paths score higher.(25, 4), initialised to zeros. Rows are states, columns are actions.Q[s, a] += alpha * (target - Q[s, a]), where target = r at a terminal step and target = r + gamma * max(Q[s_next]) otherwise. Use roughly alpha = 0.1, gamma = 0.95.builds/B1/
q_learning.py # environment + Q-table + epsilon-greedy + train loop + policy print
progress.png # steps-to-goal per episode (or print to console instead)
README.md # what you built, what you learned, gotchas, numbers observed
One file is plenty at this size (the milestone is roughly 80 lines). Splitting the environment and agent into separate files is fine but not required.
The two lines that carry the learning are left for you to write. Everything around them is scaffolding. Writing those two lines yourself is the milestone.
import numpy as np
# --- environment: 5x5 gridworld ------------------------------------
GRID = (5, 5)
START, GOAL = (4, 0), (0, 4)
OBSTACLES = {(1, 1), (2, 2), (3, 1)}
ACTIONS = [(-1, 0), (1, 0), (0, -1), (0, 1)] # up, down, left, right
N_STATES, N_ACTIONS = GRID[0] * GRID[1], len(ACTIONS)
def to_state(cell):
return cell[0] * GRID[1] + cell[1]
def step(cell, a):
dr, dc = ACTIONS[a]
nxt = (cell[0] + dr, cell[1] + dc)
on_grid = 0 <= nxt[0] < GRID[0] and 0 <= nxt[1] < GRID[1]
if not on_grid or nxt in OBSTACLES:
nxt = cell # illegal move: stay put
if nxt == GOAL:
return nxt, 1.0, True # reaching the goal ends the episode
return nxt, -0.01, False # small step cost rewards short paths
# --- agent ---------------------------------------------------------
Q = np.zeros((N_STATES, N_ACTIONS))
alpha, gamma = 0.1, 0.95
episodes, max_steps = 500, 100
def choose(s, epsilon):
# TODO: with prob epsilon return a random action (explore),
# otherwise return int(np.argmax(Q[s])) (exploit).
...
for ep in range(episodes):
epsilon = max(0.05, 1.0 - ep / (0.8 * episodes)) # explore early, exploit late
cell = START
for t in range(max_steps):
s = to_state(cell)
a = choose(s, epsilon)
nxt, r, done = step(cell, a)
s2 = to_state(nxt)
# TODO: the Q-learning update.
# target = r if done else r + gamma * np.max(Q[s2])
# Q[s, a] += alpha * (target - Q[s, a])
...
cell = nxt
if done:
break
# then: print the greedy policy as arrows, and plot steps-to-goal per episode
After training, printing the greedy policy gives a grid of arrows that trace a sensible short path from start (S) to goal (G), routing around the obstacles (#). Your exact arrows, path, and obstacle layout will differ with the seed and your choices; this is one plausible result, not a target to match:
→ → → → G
↑ # ↑ → ↑
↑ → # → ↑
↑ # → → ↑
S → → → ↑
The progress curve tells the same story over time. Early episodes take many steps to reach the goal and the count is noisy, because the agent is mostly exploring. As epsilon decays and the Q-values settle, steps-to-goal falls and then flattens near the length of a short path. Exact episode counts, final path length, and the shape of the early noise vary by run; what matters is the trend: high and noisy, then falling, then flat. If you print the Q-table early versus late, you should see value first appear in cells next to the goal, then spread outward across episodes.
Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.
max(Q[s_next]) term carries value backward, or why epsilon decay matters. The fix is to re-read L6 and trace the update by hand on a single transition near the goal. Do not mark COMPLETE.
These are conceptual traps, the places understanding slips even when the code runs.
r, with no gamma * max(Q[s_next]) term. Forgetting this leaks value past the goal and distorts the table. The single most common Q-learning error.These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.
| Symptom | Likely cause |
|---|---|
ModuleNotFoundError: No module named 'numpy' | numpy is not installed in the Python you are running. This is a setup issue, not a Q-learning one. See Tooling: Installing packages and Reading errors. |
| Agent never reaches the goal | The goal step does not return done = True, or max_steps is too small to cross the grid, or epsilon starts near zero so the agent never explores a path in. |
| Policy stays random-looking after training | The update writes to the wrong index (using s2 where s belongs, or vice versa), or epsilon never decays so the greedy policy is never exercised. Print Q for a cell next to the goal; it should be clearly nonzero. |
| Values explode or go to NaN | Bootstrapping past the terminal (adding gamma * max(Q[s_next]) at the goal), or gamma at 1.0 with a positive reward in a loop, or alpha too high. Ensure target = r at terminal and lower alpha. |
| Convergence extremely slow | Epsilon decays too slowly so the agent keeps exploring, too few episodes, or the path is long. Raise the episode count or bring the epsilon floor down sooner. |
| Agent oscillates between two cells | No step cost, so nothing rewards progress, or argmax always breaks ties the same way. Add a small negative step reward and break ties randomly. |
| Arrows point into walls or off-grid | A state-indexing bug in to_state (row and column swapped, or off by one). Print to_state for a few known cells and check against the grid. |
max(Q, axis=1) reshaped to the grid every few episodes and watch value fill outward from the goal.