# The build track (v3)

A parallel practical thread that runs alongside the lesson series. 15 core milestones plus 1 prerequisite (B0, inventory your kit) and 2 optional extensions, all numbered in the order you meet them across phases 0-6. The two optional extensions, B6 (TinyML at the edge) and B17 (distributed inference), reach the edge and distributed tiers of the compute spectrum. By the end of the core path, you've hand-built a small but real AI stack: RL agent → tokenizer → MLP → CNN → attention → transformer → pretrain → fine-tune → embedding search → local inference. The optional extensions reach to tier 0 (microcontroller) and tier 3+ (distributed inference) so the full compute spectrum is touched directly.

Each milestone is small (1-4 hours typically). The point isn't a portfolio piece. The point is to make the concepts live in your hands. Reading about attention is not the same as having implemented it.

**Crucial design principle.** Coding skill is **not** a hard prerequisite for conceptual progress. A learner who skips every build can still complete the course conceptually. Synthesis lessons and calibration assessments don't depend on builds having been done. The build track is depth-by-choice.

---

## The principles

1. **Numpy first.** Always. Framework-free implementations until autograd genuinely earns its keep. This forces the math to be visible. You can't fool yourself about whether you understand backprop after writing it in numpy.

2. **PyTorch second.** Only when the math gets too painful by hand. By that point you know what autograd is automating, so you treat PyTorch as a labour-saver, not a magic box.

3. **No Hugging Face `Trainer` magic in the build track.** Load models with `transformers`, run loops by hand. The loop is part of what you're learning.

4. **For inference: llama.cpp / ggml.** Vendor-neutral. Fully local. Stable file format that has already outlived several training frameworks. Will almost certainly outlast any specific hosted API.

5. **Avoid framework lock-in.** If a milestone could be done in JAX, PyTorch, or numpy, the milestone is described in language that maps to all three. Code samples in lessons use the option that surfaces the concept best.

6. **Plain files for data.** A folder of .txt or a .jsonl. No `datasets` library wrappers until you'd genuinely benefit from streaming or sharding.

7. **Time-box each milestone.** If a build runs over 2× its estimate, you've found a gap in understanding. Stop building, re-read the relevant lesson, then continue.

---

## The milestones

### B0: Inventory your kit

**Phase**: 0. **After lesson**: L0 (orientation). **Time**: ~30 min.

**What you build**: A short markdown file in your `builds/` folder cataloguing what you actually have. CPU (model, cores), RAM (capacity, speed), GPU if any (model, VRAM, driver version), embedded boards if any (Pi, Jetson, Arduino, ESP32 — model and memory), network setup if you'll do multi-machine work later.

**Concept reinforced**: every build that follows lands on real silicon. Knowing your actual constraints up front means later milestones won't surprise you when they don't fit. The compute-spectrum lens starts here.

**Tooling**: any text editor. `lscpu`, `nvidia-smi`, `free -h`, or equivalents.

**Anti-obsolescence**: hardware will change. The habit of writing down what you've got won't.

---

### B1: Tabular Q-learning agent on gridworld

**Phase**: 1. **After lesson**: L6 (Sequential decision making and reward). **Time**: ~3 hours.

**What you build**: A 5×5 gridworld with a goal cell and a few obstacle cells. A Q-learning agent (literal Q-table, no neural net) that learns to reach the goal. Plot the policy over training episodes.

**Concept reinforced**: state, action, reward, policy, exploration vs exploitation as a literal epsilon-greedy choice. Temporal credit assignment when you see the Q values propagate backwards from the goal.

**Tooling**: numpy + matplotlib. Maybe ~80 lines.

**Anti-obsolescence**: Q-learning is from the 1980s and remains the cleanest way to feel what RL is. This milestone will be valid in 2050.

---

### B2: Tokenizer explorer

**Phase**: 1. **After lesson**: L8 (Tokens). **Time**: ~1 hour.

**What you build**: A small CLI or notebook that takes text, runs it through 2-3 different tokenizers (GPT-style, Llama-style, character-level), shows the token IDs side-by-side, and reports token count, vocabulary coverage, and characters-per-token ratio.

**Concept reinforced**: tokens as the model's actual unit; how the same text decomposes differently across families; non-English language penalties.

**Tooling**: `tiktoken` (OpenAI), `sentencepiece` (Llama), pure Python for character-level. ~30 lines of code total.

**Anti-obsolescence**: the tokenizers themselves will evolve. The *principle* (different families see different tokens) doesn't. Keep the comparison rig, swap the tokenizers as new ones land.

---

### B3: Vector playground

**Phase**: 2. **After lesson**: L12 (Distance, similarity, and semantic geometry). **Time**: ~2 hours.

**What you build**: Load 5000 pre-trained word or sentence embeddings (sentence-transformers `all-MiniLM-L6-v2` works fine). Compute cosine similarity. Find nearest neighbours for example queries. Do `king - man + woman` style arithmetic. Project to 2D with t-SNE or PCA and visualise.

**Concept reinforced**: embeddings as direction; cosine similarity as the workhorse operation; dimensionality reduction as compression with loss.

**Tooling**: numpy, sentence-transformers (just for the embeddings), matplotlib. Avoid faiss for now (B15).

**Anti-obsolescence**: embedding models will change every year. The geometry doesn't. The code generalises to whatever embedding you load.

---

### B4: Gradient descent visualiser

**Phase**: 2. **After lesson**: L19 (Optimisation landscapes). **Time**: ~2 hours.

**What you build**: A 2D contour plot of a simple loss surface (Himmelblau or Rosenbrock). Run SGD with different learning rates, with and without momentum. Plot the trajectory of the optimiser over the contours.

**Concept reinforced**: gradients point uphill; learning rate is a step size; momentum smooths the path; loss surfaces have many minima; the optimiser doesn't know what you can see in the plot.

**Tooling**: numpy + matplotlib. ~100 lines.

**Anti-obsolescence**: the optimiser zoo will keep growing (Adam, Lion, Sophia, the next one). The plot stays.

---

### B5: Profile your compute and bandwidth

**Phase**: 3. **After lesson**: L28 (The roofline model). **Time**: ~3 hours.

**What you build**: Write a tiny matmul kernel in numpy (or PyTorch if you have a GPU). Measure wall-clock time for matrix sizes from 64×64 up to 4096×4096. Compute achieved FLOPS. Plot against your hardware's theoretical peak. Identify where you're memory-bound vs compute-bound.

**Concept reinforced**: the roofline isn't a textbook diagram; it's your laptop. You feel the memory wall directly.

**Tooling**: numpy + matplotlib. PyTorch if you have CUDA. Optionally `nvidia-smi` or `perf` for memory bandwidth measurements.

**Anti-obsolescence**: hardware will change. The roofline shape and the way you measure it won't.

---

### B6 (optional): TinyML classifier on an MCU

**Phase**: 3. **After lessons**: L30 (NPUs and edge inference) and L32 (quantisation). **Time**: ~6-8 hours (much of it tooling setup). **Pathway role**: optional in the core; prerequisite for the Edge AI Lab (see `build-track-pathways.md`).

**What you build**: A small classifier (audio keyword spotting, or accelerometer-gesture recognition) running on an Arduino-class or ESP32 board. Train a tiny model on a host machine. Quantise to int8. Convert to TensorFlow Lite Micro (or equivalent). Deploy. The whole thing fits in ~256 KB flash and tens of KB RAM, runs on milliwatts.

**Concept reinforced**: model compression all the way to tier 0. The same mechanisms (representation, optimisation, quantisation) work at 6 orders of magnitude less compute and memory than a frontier model. Constraints shape systems is now physical.

**Tooling**: TensorFlow Lite Micro or equivalent. An Arduino, ESP32, or Pico board. A microphone or accelerometer.

**Anti-obsolescence**: the toolchain will move; the principle (everything runs everywhere, with the right compression) is permanent.

---

### B7: Quantisation lab

**Phase**: 3. **After lesson**: L32 (Quantisation). **Time**: ~2 hours.

**What you build**: Take a small pretrained model (a ~125M parameter GPT-2 small or similar). Quantise the weights to int8 and int4 with simple linear quantisation. Compare perplexity on a small text sample at fp32, int8, int4. Plot the trade-off.

**Concept reinforced**: quantisation isn't free; the curve from fp32 → int4 isn't smooth; some layers tolerate it better than others.

**Tooling**: PyTorch for model loading, numpy for the quantisation math (do it by hand, don't call `torch.quantize`).

**Anti-obsolescence**: the model used will date. The exercise (measure quality vs precision) is permanent.

---

### B8: MLP from scratch (numpy, backprop by hand)

**Phase**: 4. **After lesson**: L36 (Backpropagation). **Time**: ~4 hours.

**What you build**: A 2-layer MLP that classifies MNIST digits. Forward pass, loss, backward pass, weight update. All in numpy. No autograd. No PyTorch. Hit ~95% accuracy.

**Concept reinforced**: backprop is the chain rule, all the way down. The mystery dissolves when you've written every partial derivative.

**Tooling**: numpy. That's it. Maybe matplotlib to plot loss curves.

**Anti-obsolescence**: this is the most important milestone for long-term understanding. Numpy will be around in 2050. Backprop will be around in 2150.

---

### B9: Tiny CNN

**Phase**: 4. **After lesson**: L37 (Convolutional nets). **Time**: ~3 hours.

**What you build**: A small CNN (2-3 conv layers, 1 dense layer) that classifies MNIST or CIFAR-10. Now in PyTorch, because doing conv backprop in numpy is masochism without benefit. Compare accuracy to your MLP from B8.

**Concept reinforced**: locality and weight sharing pay off; PyTorch autograd is doing the work you did by hand in B8.

**Tooling**: PyTorch (autograd earns its keep here). ~80 lines.

**Anti-obsolescence**: CNN-on-image-classification is a 2012 win. Still works, still teaches the same lesson. PyTorch may eventually be replaced; the architecture won't be.

---

### B10: Attention from scratch (numpy)

**Phase**: 4. **After lesson**: L40 (Attention). **Time**: ~3 hours.

**What you build**: A single attention head in numpy. Q, K, V projections, scaled dot product, softmax, weighted sum of values. Apply it to a toy sequence of word embeddings and visualise the attention matrix.

**Concept reinforced**: the attention pattern is *just* a softmax over dot products of Q and K. The full transformer is one repetition of this block.

**Tooling**: numpy. ~60 lines.

**Anti-obsolescence**: this implementation will be valid as long as transformers are in use. Even if transformers get replaced, having written attention by hand pays back in understanding the next architecture.

---

### B11: Tiny transformer

**Phase**: 4. **After lesson**: L43 (The transformer block). **Time**: ~4 hours.

**What you build**: A small transformer (4-6 layers, ~1M parameters) in PyTorch. The model only, no training yet. Implement multi-head attention, feed-forward, residual + layer norm. Karpathy's `nanoGPT` is a good reference but don't copy it; implement from your own notes.

**Concept reinforced**: the transformer is repeats of a single block. You should be able to draw the architecture on the back of a napkin after this.

**Tooling**: PyTorch. ~150-200 lines.

**Anti-obsolescence**: transformers may not be the dominant architecture in 2035. Even so, the practice of building a real architecture from scratch is permanent.

---

### B12: Pretrain your tiny transformer

**Phase**: 5. **After lesson**: L49 (Pretraining). **Time**: ~3 hours active + overnight training.

**What you build**: Train the B11 transformer on tinyshakespeare (~1MB of text) or the first 100MB of openwebtext. Implement the training loop by hand: forward, loss (cross-entropy on next token), backward, optimiser step, gradient clipping, learning rate schedule. Sample from the model. Watch the loss come down.

**Concept reinforced**: the entire pretraining loop is ~50 lines of code wrapped around a transformer. The cost dominates; the algorithm is simple.

**Tooling**: PyTorch. AdamW. Maybe a single GPU; CPU works for tinyshakespeare with patience.

**Anti-obsolescence**: pretraining recipes will evolve. The loop shape (forward, loss, backward, step) is a 70-year-old idea that isn't going anywhere.

---

### B13: SFT on a small open model

**Phase**: 5. **After lesson**: L52 (Supervised fine-tuning). **Time**: ~3 hours.

**What you build**: Take a small open instruction-tuned base (e.g. Pythia-1B, TinyLlama-1B, Qwen2.5-0.5B). Fine-tune on ~1000 instruction-response pairs from a public dataset. Compare generations before and after.

**Concept reinforced**: SFT is just supervised next-token prediction on a curated dataset. The "magic" of instruction-tuned models is mostly the dataset, not the algorithm.

**Tooling**: Hugging Face `transformers` for model loading. Write the training loop by hand (no `Trainer`). LoRA via `peft` is fine to keep this on a single GPU.

**Anti-obsolescence**: the models used will date fast. The exercise (supervised fine-tune, observe change) is permanent.

---

### B14: DPO toy run

**Phase**: 5. **After lesson**: L54 (DPO and lighter alternatives). **Time**: ~3 hours.

**What you build**: Take a small SFT'd model. Use a public preference dataset (~500 pairs is enough). Run DPO. Compare generations before and after. Notice that the model now prefers the chosen responses.

**Concept reinforced**: DPO is a simple loss function over preference pairs. You can see the alignment-via-preferences story without the RLHF machinery.

**Tooling**: PyTorch + `trl` library, or implement the DPO loss by hand (it's ~10 lines).

**Anti-obsolescence**: the specific alignment method-of-the-day will change. The pattern of "use preferences to shape behaviour" will not.

---

### B15: Embedding search

**Phase**: 6. **After lesson**: L59 (Embeddings in practice). **Time**: ~3 hours.

**What you build**: A small local corpus (your own notes, a single book, 1000 PDFs). Chunk it, embed it (sentence-transformers), build a flat index (numpy cosine similarity is fine for <100k chunks). Build a query loop. Measure retrieval quality on a handful of queries you write yourself.

**Concept reinforced**: RAG starts with retrieval, and retrieval quality is the bottleneck.

**Tooling**: numpy for the index. sentence-transformers for embeddings. Avoid Pinecone, Chroma, Weaviate for this milestone. You want to see the math.

**Anti-obsolescence**: vector stores will consolidate. Cosine similarity over dense vectors will not change.

---

### B16: Local inference rig (llama.cpp)

**Phase**: 6. **After lesson**: L66 (Inference engines). **Time**: ~2 hours.

**What you build**: Get llama.cpp running on your machine. Download a quantised GGUF of a 7B-class model (Llama 3, Qwen 2.5, Mistral, doesn't matter). Run inference from the command line and from a Python binding. Measure tokens per second. Try int4, int5, int8 quantisations and compare quality.

**Concept reinforced**: you can run a real model entirely on your own hardware. The whole field doesn't depend on the cloud.

**Tooling**: llama.cpp. `llama-cpp-python` for the binding.

**Anti-obsolescence**: llama.cpp is the most durable inference tool in the open ecosystem. GGUF as a format will likely outlive every other current format. Even when something replaces it, the *skill* of running a model locally is permanent and aerospace-friendly (on-prem, no cloud dependency, no data egress).

---

## Optional spectrum extensions

The two optional builds, B6 (TinyML, in Phase 3 above) and B17 (distributed inference, below), reach to the edge and distributed tiers of the compute spectrum. The conceptual lessons cover the same ground; the builds make the constraints physical.

### B17 (optional): Distributed inference across 2 machines

**Phase**: 6. **After lessons**: L66 (inference engines) and L67 (private/on-prem). **Time**: ~4-6 hours. **Pathway role**: optional in the core; prerequisite for the AI Server Lab (see `build-track-pathways.md`).

**What you build**: Serve a single model with KV cache or attention compute split across 2 networked machines. Either using a real distributed inference engine (vLLM or sglang in distributed mode) or a simpler hand-rolled pipeline-parallel split over sockets between 2 of your own machines.

**Concept reinforced**: tier-3+ patterns are accessible to a serious sideline, not a hyperscaler-only domain. The systems engineering (placement, bandwidth, latency) becomes visible when you have to wire it yourself.

**Tooling**: vLLM, sglang, or hand-rolled PyTorch with `torch.distributed`. Two machines on the same network.

**Anti-obsolescence**: distributed inference will keep evolving. The pattern of "split a model across machines, hide latency in the pipeline" is permanent and aerospace-friendly (on-prem rack of modest boxes can serve a real model).

---

## Specialisation pathways (optional)

The 18 milestones above (B0-B17) are the Core Build Track and the spine. Every learner does the core; it stays the source of truth for itself.

On top of the core sit four optional specialisation pathways, one per deployment environment on the compute spectrum: the Budget Lab (tier 1), the Edge AI Lab (tiers 0-1), the Prosumer Lab (tier 2), and the AI Server Lab (tier 3). They are depth-by-choice, independent of each other, and ranked equally. Each ends in a capstone the learner keeps using.

The pathways are defined in `build-track-pathways.md`, which uses its own milestone prefixes (BL, EL, PL, SL) and does not extend the B-number sequence. This document and that one do not overlap. The Build Track is governed by `governance/build-track-governance.md`, with naming rules in `governance/build-track-identity-standard.md` and validation in `validation/build-track-validation-standard.md`.

---

## Build artefacts go where?

Suggested: a separate `CLAUDE OUTPUTS/AI Course/builds/` folder, one subfolder per milestone (`B01-q-learning/`, `B02-tokenizer-explorer/`, ...). Each contains:

- `README.md`: what you built, what you learned, gotchas you hit.
- The code.
- Output plots, sample generations, or measurements.

The README is the most valuable part. Future-you will re-read it. Future-you will not re-read the code.

---

## What gets left out and why

**No agent framework builds** (LangChain, LlamaIndex, AutoGen, CrewAI). The lessons cover what these do; the builds use raw API calls so the loop is visible. Frameworks change yearly; the loop doesn't.

**No fine-tuning of frontier models.** Cost-prohibitive and the lesson doesn't need it. SFT on a 0.5-1B model teaches the same mechanics.

**No multimodal training builds.** Too expensive and too dependent on specific datasets. The conceptual lesson (L46) is sufficient; if you want a build later, it'd be an inference-side multimodal task.

**No production deployment builds (Kubernetes, model serving infra).** These are valuable but they're DevOps skills more than AI skills. Out of scope.

---

## Optional phase 7 extensions

These aren't required but are good if you want to extend the build track into the frontier phase.

- **Tiny world-model agent.** Train a model to predict next observation in gridworld; use it for planning. Maps to L72.
- **Mechanistic interpretability mini-experiment.** Take your B11 transformer; probe a single attention head; find what it attends to. Maps to L77.
- **Synthetic data self-train experiment.** Generate synthetic training data with a small model; train a smaller model on it. Maps to L55.

These get added if and when you have the appetite. Phase 7 is primarily a reading phase.

---

## Time totals

Roughly:
- B0 (inventory): 30 min
- Phase 1 builds: 4 hours
- Phase 2 builds: 4 hours
- Phase 3 builds: 5 hours
- Phase 4 builds: 14 hours (the big phase)
- Phase 5 builds: 9 hours
- Phase 6 builds: 5 hours
- **Core total: ~41.5 hours** plus an overnight pretraining run.
- B6 (optional TinyML): +6-8 hours
- B17 (optional distributed): +4-6 hours
- **Full-spectrum total: ~52-55 hours.**

At 1-2 hours per evening that's 4-6 weeks of part-time core build work, spread across the 8 months of the course. The optional extensions add another 1-2 weeks. Comfortably sideline-paced.

---

## Anti-obsolescence summary

If 5 years from now PyTorch has been replaced by something else, llama.cpp has been replaced, all the current vendors are gone:

- B0, B1, B2, B4, B8, B10 still work unchanged (pure numpy or just text).
- B3, B5, B7, B9, B11, B12, B13, B14 need a tooling swap but the milestone definition holds.
- B6, B15, B16, B17 need a tooling swap; the *skill* is permanent.

The skills you build don't depend on the vendor. That's the design intent.

---

## Compute-spectrum coverage

By tier, after the full set of milestones:

| Tier | Substrate | Milestones |
|------|-----------|------------|
| 0 | Microcontroller / MCU       | B6 (optional)                       |
| 1 | Edge / NPU / laptop CPU     | B1-B5, B15, B16 work here            |
| 2 | Consumer GPU / workstation  | B7, B9, B11, B12, B13, B14, B16      |
| 3 | Multi-machine / small cluster | B17 (optional)                     |
| 4 | Distributed cluster         | covered conceptually in L57          |
| 5 | Hyperscale                  | covered conceptually in L49, L57     |

A learner who runs the full set, B0 through B17 including the optional B6 and B17, has touched the spectrum from MCU to multi-machine with their own hands. Tier 4 and 5 are conceptual; the cost of personal builds at that scale exceeds the lesson value.