Syllabus · v3

The full route, phase by phase.

8 phases, 79 numbered lessons (L1–L79) plus a Lesson 0 orientation, 8 phase syntheses, 8 calibration gates. The order is the order in which the field actually became possible: foundations first, then maths, hardware, architectures, training, deployment, research literacy, and finally the frontier.

Phases: 8 Numbered lessons: 79 (+ L0) Synthesis lessons: 8 Calibration gates: 8 Total sittings: 95 Pace: 2–3 lessons / week Duration: ~8 months

The progression

Each phase rests on the phases beneath it.

Architectures only make sense after the hardware they were shaped to fit. Training only makes sense after the architectures it has to scale. Deployment only makes sense after something has been trained. Each phase exits through a synthesis lesson (S) and a calibration assessment (C). Calibrations gate the move to the next phase.

Fig 1 · Phase dependency. Each box is a phase, each arrow is a dependency. S+C markers between phases are the synthesis lesson and calibration assessment that close each room. The violet band beneath is the parallel build track.

The 5 recurring core laws · threaded across every phase

Representation shapes computation. est. L7 · recurs P3, P4, P6
Optimisation shapes capability. est. L5 + Phase 2 · recurs P4, P5, P7B
Hardware shapes architecture. est. Phase 3 · recurs P4, P5, P6
Geometry enables generalisation. est. L9 · recurs P4, P6
Constraints shape systems. threaded everywhere · the compute-spectrum lens is one expression of it

Phase 0 · orientation

Entering the stack

the doorway

The threshold-crossing. The reader stands at the workshop entrance, reads the schematic, and understands the route before walking it.

Conceptual shift: from public discourse (mystification or dismissal) to mechanism. What you gain: the worldview the rest of the course assumes.

Lessons: 1 (L0) Time: ~30 minutes Synthesis / calibration: none Read: once at the start; re-read at any phase boundary if motivation flags

Lesson

L0 Entering the stack Worldview, course architecture, why the lessons are sequenced this way. How retrieval practice, the memory palace, the glossary tooltips, synthesis lessons, calibration assessments, and the build track all earn their place.

buildB0 (prerequisite). Inventory your kit. After L0.

Phase 1 · foundations

Foundations of intelligence

the bench

What an intelligence system actually is, what learning means, and the shape of the problem before any maths or models.

Conceptual shift: from "AI" as a single mysterious thing to AI as optimisation against an objective, shaped by signal and constraints. What you gain: a vocabulary and a mental model, not a toolkit. Why it sits here: you need a clear noun before any analysis can begin.

Lessons: 10 (L1–L10) Synthesis: S1 Calibration: C1 Time: ~4 weeks Overview: phase 1 page →

Lessons in this phase

L1What is an intelligence systemA working definition. Inputs, internal state, outputs, learning signal. Why "AI" is a confusing umbrella over very different things.

L2Pattern, prediction, compressionThe Shannon and Kolmogorov view: a good model is a short description of regularities. Prediction and compression are the same problem.

L3GeneralisationTraining error vs test error. Why memorisation is easy and generalisation is hard. The bias-variance picture without the algebra.

L4Emergence (first pass)When more becomes different. Phase transitions in scaling. Mechanistic framing, not mystical. Revisited at L75.

L5Learning paradigmsSupervised, unsupervised, self-supervised, reinforcement. What signal each one uses and what it can and can't teach.

L6Sequential decision making and rewardRL as a foundational paradigm. State, action, reward, policy. Exploration vs exploitation. Temporal credit assignment.

L7RepresentationWhat it means to "represent" something inside a system. Why representation choice often matters more than the algorithm running on top.

L8TokensQuantising messy input into discrete units. Byte pair encoding. Why this isn't obvious and why it matters downstream.

L9Embeddings (intuition)Meaning as direction in a high-dimensional space. Why this works at all. Formal treatment lives in phase 4.

L10What current AI can and can't doThe honest perimeter. Stable capability, brittle trick, confidently wrong output that looks like capability.

S1Phase 1 synthesis · the whole benchWalk all 10 stations end to end. Compress the systems vocabulary. Core laws callback. Bridge to Phase 2.

C1Phase 1 calibration · mechanism check4–6 open-ended questions: compression recall, systems walkthrough. Gate to Phase 2.

buildB1. Tabular Q-learning agent on gridworld (numpy). After L6. B2. Tokenizer explorer (numpy). After L8.

Phase 2 · maths intuition

Mathematical & computational intuition

the whiteboard wall

The minimum maths to read the rest of the course without bluffing. Geometric and visual. No proofs, no exam questions.

Conceptual shift: from "AI is opaque maths" to AI as vectors and geometry we can sketch on a board. What you gain: a single durable toolkit. Why it sits here: every later phase depends on these primitives.

Lessons: 11 (L11–L21) Synthesis: S2 Calibration: C2 Time: ~4 weeks Overview: phase 2 page →

Lessons in this phase

L11VectorsDirection plus magnitude. The geometric picture before the algebra.

L12Distance, similarity, and semantic geometryEuclidean distance, dot product, cosine similarity. Where meaning becomes geometry. The single most-used operation family in modern AI.

L13Dimensions, feature spaces, and representation capacityWhat it means for vectors to live in 768 or 4096 dimensions. Learned features as the modern move. Separability via lift.

L14Matrices and linear transformsA matrix as a thing that bends space. Why neural networks are stacks of linear transforms with nonlinearities between them.

L15Projections, subspaces, and information selectionA projection keeps some information and throws the rest away. Subspaces, lossy compression that preserves useful structure, and why selection runs through embeddings, attention, and feature extraction.

L16Probability, uncertainty, and beliefProbability as measured uncertainty. Distributions, confidence vs certainty, calibration, and why every AI prediction is a statement about uncertainty.

L17Distributions, sampling, and possible futuresA distribution is the landscape of all outcomes; sampling draws one. Temperature reshapes the landscape. Why an LLM predicts a distribution over tokens, not a single word.

L18Entropy, uncertainty, and surpriseEntropy as the average surprise of a distribution: one number for how much uncertainty remains. Ties prediction to compression; cross-entropy is the standard training loss.

L19Gradients and optimisation landscapesTraining as optimisation: loss is the height, the gradient is the slope, descent steps downhill. Minima, saddle points, learning rate, and why training large models is expensive.

L20ParallelismSerial vs parallel work; why matrix-heavy AI parallelises so well. CPU vs GPU, SIMD, data and model parallelism, throughput vs latency. The bridge from theory to scale.

L21Compute scaling intuitionMore compute helps, but with diminishing returns; compute, data, and model size grow together. Scaling as an empirical observation. The intuition before the formal scaling laws in Phase 5. Closes the wall.

S2Phase 2 synthesis · the whole wallStep back; read the wall as one picture. The five themes (representation, transformation, uncertainty, learning, scale) as a single toolkit. Bridge to Phase 3 silicon.

C2Phase 2 calibrationReasoning-based diagnostic across the five themes and the connections between them. Readiness, not grades. Gate to Phase 3.

buildB3. Vector playground: embeddings, cosine similarity, t-SNE (numpy). After L12. B4. Gradient descent visualiser on a 2D loss surface (numpy). After L19.

Phase 3 · hardware

Hardware & systems

the server bay

The substrate. The reader cannot understand modern AI without understanding the silicon it runs on and the memory bottlenecks that shape every architecture decision.

Conceptual shift: from "compute is a budget line" to compute as the constraint that produced the entire field's shape. What you gain: the instinct to ask "where does this hit the memory wall." Why it sits here: Phase 4's architectures are downstream of this substrate; without it, they look like arbitrary choices.

Lessons: 12 (L22–L33) Synthesis: S3 Calibration: C3 Time: ~5 weeks Compute-spectrum lens: dominant here Overview: phase 3 page →

Lessons in this phase

L22The general-purpose CPUSerial workhorse, deep pipelines, caches. What it's good at and where it falls down.

L23Why GPUs existThroughput over latency. The fundamental trade that produced an entire industry.

L24The GPU execution modelCores, warps, lanes, SIMT.

L25VRAMCapacity, bandwidth, the memory wall. Why model size is often capped by VRAM, not compute.

L26Tensor cores and matmul enginesSpecialised silicon for matrix multiply. Why the field standardised on matmul-heavy architectures.

L27Memory hierarchiesRegisters, SRAM, HBM, DRAM, disk. Each level roughly 10× slower than the one above.

L28The roofline modelCompute-bound vs memory-bound kernels. How to look at a workload and know which one matters.

L29TPUs and other acceleratorsSystolic arrays. What changes when you design silicon for one workload.

L30NPUs and edge inferenceAI on phones, laptops, embedded boards. Different constraints, different shape of hardware.

L31InterconnectsNVLink, InfiniBand, PCIe. Bandwidth between chips matters as much as bandwidth inside them.

L32QuantisationFP32 → FP16 → BF16 → FP8 → int8 → int4. What you give up and why int4 still ships.

L33Distributed compute patternsData, tensor, pipeline, expert parallel. The 4 ways to split a model.

S3Phase 3 synthesis · the cold-aisle walkMemory hierarchy as the spine. Roofline as the constraint surface. Matmul-shaped silicon as the response. Bridge to Phase 4.

C3Phase 3 calibrationHardware constraint analysis weighted: arithmetic intensity, precision damage, parallel strategy comparison. Gate to Phase 4.

buildB5. Profile your compute and bandwidth (roofline) on your own machine. After L28. B6 (optional; prerequisite for the Edge AI Lab): TinyML classifier on an MCU. After L30 and L32. B7. Quantisation lab: int8 / int4, measure the quality drop. After L32.

Phase 4 · architectures

Neural architectures

the drafting table

The history of how nets got shaped, in order, with the constraints that drove each step. Perceptron → MLP → CNN → RNN → LSTM → attention → transformer → diffusion → MoE → multimodal → discriminative.

Conceptual shift: from architectures as a list of names to architectures as a causal chain of constraint responses. What you gain: the ability to read a new architecture as a response to a specific prior limit. Why it sits here: with the substrate and the maths in place, the chain becomes legible.

Lessons: 14 (L34–L47) Synthesis: S4 Calibration: C4 Time: ~6 weeks

Lessons in this phase

L34The perceptronThe original neuron. Linear separability, the XOR wall, the 1969 collapse and its lesson.

L35MLPs and universal approximationA stack of perceptrons with a nonlinearity can approximate anything. The catch is data and compute.

L36BackpropagationThe chain rule applied at scale. Why it took until the 1980s to land and the 2010s to scale.

L37Convolutional netsLocality and weight sharing. Why CNNs dominated image tasks for a decade.

L38Recurrent netsLoops in the architecture, state across time.

L39Vanishing gradients and LSTMsWhat broke about RNNs and how gating mostly fixed it.

L40AttentionSoft lookup over the past, learned weighting. The mechanism that unstuck long-sequence modelling.

L41Multi-head attentionDifferent heads learn different relations.

L42Positional encodingThe transformer doesn't know order by default. Sinusoidal, learned, RoPE, ALiBi.

L43The transformer blockAttention + feed-forward + residual + norm. The whole architecture is stacked copies of this.

L44DiffusionA different idea entirely: gradually denoise.

L45Mixture of expertsConditional compute. Many small experts, route to a few per token.

L46Multimodal architecturesCLIP-style alignment, vision encoders feeding LLMs, audio in.

L47Discriminative architecturesNot everything generates. Classifiers, rankers, two-tower retrievers; the other half of the field.

S4Phase 4 synthesis · all 14 sketches in orderEach architecture as a response to the previous architecture's limit, evaluated against the hardware available at the time.

C4Phase 4 calibrationArchitecture analysis weighted: identify limit response, compare under named budget, diagnose long-context failure. Gate to Phase 5.

buildB8. MLP from scratch, backprop by hand (numpy). After L36. B9. Tiny CNN (PyTorch). After L37. B10. Attention from scratch (numpy). After L40. B11. Tiny transformer (PyTorch). After L43.

Phase 5 · training

Training & scaling

the foundry

Where models get made. The loop, the data, the cost, and the post-training stack that takes a raw pretrained model to something useful.

Conceptual shift: from architecture-as-the-product to training-signal-as-the-product. What you gain: the ability to read a model's behaviour back to its objective. Why it sits here: trained models depend on architectures (P4) and hardware (P3); deployment (P6) depends on having something trained.

Lessons: 10 (L48–L57) Synthesis: S5 Calibration: C5 Time: ~4 weeks

Lessons in this phase

L48Datasets and corporaWhere the training data comes from, how it's filtered, why quality often beats algorithmic cleverness.

L49PretrainingThe loop: forward pass, loss, backward pass, optimiser step. The honest cost in GPU-hours and dollars.

L50The compute-data-parameters triangle3 axes you can scale. Why scaling them in lockstep matters.

L51Scaling lawsChinchilla as the canonical paper. What the laws predict, where they hold, where they break.

L52Supervised fine-tuningTake a pretrained model, show it the kind of completions you want.

L53RLHFReward models, PPO, alignment-from-feedback. The second-pass mechanism of RL (recall L6).

L54DPO and lighter alternativesDirect preference optimisation. The simpler-than-RLHF wave.

L55Synthetic data and self-trainingModels teaching models. Why this is more important now than 3 years ago.

L56DistillationBig model in, small model out, preserving most of the capability.

L57Distributed training systemsFSDP, DeepSpeed, ZeRO. How a single logical model gets sharded across thousands of chips.

S5Phase 5 synthesis · the production lineRaw data → pretraining → post-training → distributed orchestration. Triangle as the meta-shape; scaling laws as the curve. Bridge to Phase 6.

C5Phase 5 calibrationDeployment reasoning and tradeoff weighted: Chinchilla allocation, regression diagnosis, RLHF vs DPO under constraints. Gate to Phase 6.

buildB12. Pretrain your tiny transformer. After L49. B13. SFT on a small open model. After L52. B14. DPO toy run. After L54.

Phase 6 · deployment

AI engineering & deployment

the lab bench

With the substrate, architectures, and training in hand, this phase builds production systems with them. APIs, retrieval, classical production ML, agents, evaluation, inference stacks, private deployment.

Conceptual shift: from "the LLM is the system" to "the LLM is one component in a hybrid system." What you gain: the ability to design production AI under named cost, latency, and data-egress constraints. Why it sits here: deployment depends on everything above; nothing meaningful can be deployed without it.

Lessons: 10 (L58–L67) Synthesis: S6 Calibration: C6 Time: ~4 weeks Compute-spectrum lens: frequent (tier 2 + tier 3)

Lessons in this phase

L58The API callWhat's actually crossing the wire. Streaming, retries, idempotency.

L59Embeddings in practiceCosine similarity, indexing, dimensionality choices. Where embeddings break.

L60RAG end-to-endChunk, embed, retrieve, rerank, generate. Where each step fails and what to instrument.

L61Classical ML in productionMost production ML isn't generative. Recommenders, two-tower ranking, XGBoost on tabular. A/B testing as a discipline.

L62Tool use and function callingThe schema, the loop, the failure modes.

L63AgentsPlan, act, observe, repeat. Why most agents are bad.

L64MCP and protocolsThe protocol layer for tool use.

L65EvaluationEval sets, judge models, red teaming.

L66Inference enginesKV cache, batching, speculative decoding, paged attention.

L67Private and on-prem AIWhen you can't or won't use a hosted API. Air-gapped is possible.

S6Phase 6 synthesis · the wiring rigEvery connection traced. Hybrid systems are the production reality. Bridge to Phase 7A.

C6Phase 6 calibrationDeployment reasoning, failure diagnosis, architecture analysis. Gate to Phase 7A.

buildB15. Embedding search over a local corpus (numpy index). After L59. B16. Local inference rig (llama.cpp). After L66. B17 (optional; prerequisite for the AI Server Lab): distributed inference across 2 machines. After L66 and L67.

Phase 7A · research literacy

Reading the frontier

the stairs

Before you read the frontier, you need to know how to read research. 3 lessons. These come deliberately late because the prior phases supply the technical material the reader applies the skill to.

Conceptual shift: from passive consumer of press releases to active reader of primary sources. What you gain: the 4 questions to ask of any "new SOTA" claim. Why it sits here: the skill needs technical content (P1–P6) to be applied to.

Lessons: 3 (L68–L70) Synthesis: S7A Calibration: C7A (meta-check against a real paper) Time: ~1.5 weeks

Lessons in this phase

L68Reading an AI paperAnatomy of an ML paper. How to identify the load-bearing claim.

L69Benchmarks and how they lieSaturation, contamination, leakage, prompt-engineering gaming.

L70Reading scaling graphs criticallyLog-log plots, axis choices, the Chinchilla replot saga.

S7APhase 7A synthesis · looking down the staircaseLoad-bearing claim, benchmark honesty, graph honesty. The 4 questions. Bridge to Phase 7B.

C7APhase 7A calibrationReal paper / benchmark / scaling graph chosen at calibration time. 4 critical-reading questions. Gate to Phase 7B.

Phase 7B · frontier

Frontier intelligence

the roof

Where the field is reaching now. Cautious, specific, honest about what's hype and what's solid. Date-stamped: lessons here have a 1–2 year half-life and explicitly ask the reader to update from primary sources.

Conceptual shift: from "current models = AI" to current models as the visible surface of a research programme with multiple open frontiers. What you gain: the equipment to follow primary sources for the next 1–2 years before these lessons need a refresh. Why it sits here: reading the frontier needs P7A literacy and the substrate of everything before.

Lessons: 9 (L71–L79) Synthesis: S7B Calibration: C7B (closing meta-check) Time: ~3.5 weeks Half-life: 1–2 years; date-stamped

Lessons in this phase

L71Reasoning modelsRL on chains of thought. Third pass on RL (recall L6, L53).

L72World modelsPredicting future states of the world, not just next tokens.

L73Embodiment and robotics foundation modelsRT-2, Helix, and the current state of play. RL meets the physical world.

L74Self-modeling and introspectionCan a model represent its own state and use that representation.

L75Emergence revisited (mechanism, not magic)Second pass on L4. Capability vs measurement vs representation emergence.

L76Theories of consciousnessIIT, GWT, higher-order theories. What they predict and where they're unfalsifiable.

L77Alignment and interpretabilityScalable oversight, mech interp, faithful chain of thought.

L78AGI hypotheses and timelinesThe honest spread of views.

L79Future hardwarePhotonics, neuromorphic, optical interconnects, beyond-CMOS.

S7BPhase 7B synthesis · the full horizon surveyAll 5 core laws show up here at once. Course closes by acknowledging what's open.

C7BPhase 7B calibration · closing meta-check4–6 questions spanning the whole course. The last gate.