# Constitution

The governing document for this course. The other files (syllabus, palace, dependencies, handoff) describe what the course *contains*. This document describes what the course *is*, and what it must refuse to become.

If a future lesson, refinement, or external pressure conflicts with this document, this document wins. Drift is the failure mode that kills courses like this. The Constitution is the fence.

---

## 1. What this course is

A systems-level account of modern AI for technical readers. Silicon to cognition. Built for someone who already knows how complex systems work in another domain and wants to know how this one works, from the ground.

The arc is causal. Hardware shapes architecture. Architecture shapes training. Training shapes deployment. Deployment shapes engineering practice. The reader is meant to leave with a model of *why* the field looks the way it does, not just *what* its current artefacts are called.

The course assumes the reader can already read carefully, follow chains of reasoning, hold abstractions in their head, and tolerate honest "we don't know" answers.

## 2. What this course is not

Not a prompt engineering course. Not an LLM application tutorial. Not a coding bootcamp. Not a survey of the AI industry. Not a vibes-based introduction. Not a checklist of tools.

Not optimised for speed of completion. The course aims for things that stick, not things that finish.

Not a place where transformers are the destination. Transformers are a stop on the route, late in the journey. The course covers what came before, what runs in parallel, what may come next, and what runs in production alongside generative models.

Not vendor-aligned. The course teaches mechanisms; vendors are examples.

## 3. Why systems-level understanding matters

The field's headline behaviour (a model that talks, an image generator that paints) is downstream of about 8 decisions: what gets tokenised, what gets multiplied, what gets stored, what gets parallelised, what gets optimised, what gets fine-tuned, what gets served, what gets monitored.

If you understand those 8 layers and how they push on each other, the headlines stop being surprising. New models, new capabilities, new failures, all become predictable as variations on the same machinery.

If you only understand the headlines, every new release is a fresh confusion. You're permanently in catch-up.

This course is for the first kind of understanding.

## 4. Why hardware sits underneath everything

Most courses treat hardware as a footnote. Models live in the cloud, compute is a budget line, GPUs are someone else's problem.

That framing is wrong. Modern AI exists because matrix multiplies are fast on specific silicon. Architectures evolved to match that silicon. Training methods evolved to feed it. Inference methods evolved to keep it busy. The shape of the field is downstream of memory bandwidth and tensor core throughput in a way most reading material understates.

A reader who finishes this course should be able to look at a new architecture and ask: where does this hit the memory wall, what fraction of theoretical FLOPS does it leave on the table, why is it shaped to fit the hardware it's running on. That instinct is rare and high-value.

## 5. Why history matters

The current architectures are the survivors of a chain of attempts. Perceptron, MLP, CNN, RNN, LSTM, attention, transformer, MoE, diffusion. Each one solved problems the previous one hit, and inherited limitations the next one had to fix.

If you skip the history, you have no way to evaluate the next architecture when it lands. You take whatever is current as given. You have no taste.

History also teaches honest humility. The field has had several "this is the answer" moments. Many turned out to be local maxima. Reading the history teaches you to hold the current frame loosely.

## 6. Why intuition precedes formalism

Math without intuition produces a learner who can reproduce derivations but can't generalise. Intuition without math produces a learner who has vibes but can't compute.

Both fail. The course does intuition first, formalism second, and only formalism that earns its place.

Geometric pictures before algebra. Verbal explanations before symbols. Worked numerical examples before closed-form expressions. By the time a formula appears, the reader already knows what it has to do; the formula is then a compression of that understanding.

The wrong way: derive the softmax from the partition function in chapter 1. The right way: explain what softmax does to a vector of scores, show a worked example, then write it down. The reader meets the symbol last, not first.

## 7. Why build tracks matter

Concepts that only live in reading are easy to misunderstand without noticing. Concepts you've built with your hands are harder to misunderstand because the gaps reveal themselves on contact with code.

The build track is not about producing portfolio pieces. It is about exposing your own gaps to yourself. If a build takes 2× the estimate, that is information: there is a concept you do not actually hold, despite feeling like you do. Stop, re-read, continue.

The build track is also a hedge against framework lock-in. Numpy implementations survive when the frameworks change. The skill of writing backprop by hand is permanent. The skill of calling `model.fit()` is not.

## 8. Why vendor-neutrality matters

Specific vendors will rise and fall on a 1-3 year cycle. Specific models will be superseded yearly. Specific APIs will deprecate.

A course tied to current vendors dies with them. A course tied to mechanisms outlives the entire current generation.

Vendors appear as examples. "An inference engine such as llama.cpp" rather than "the llama.cpp inference engine." The reader leaves with categories, not allegiances.

The course is also explicitly aerospace-friendly: a reader in a regulated environment should be able to apply most of it on-prem, air-gapped, on hardware they own, with no cloud dependency.

## 9. Why research literacy matters

The field changes faster than any textbook can keep up. The course teaches up to 2025-2026 frontier topics. By the time someone is 18 months into reading the course, the frontier will have moved.

The defence is not constant rewrites of the frontier section. The defence is teaching the reader to read the new papers themselves: to spot benchmark gaming, to read scaling graphs critically, to distinguish marketing from mechanism, to ask the right 4 questions of every "we set a new SOTA" claim.

A reader who can read research independently is durable. A reader who cannot will be dependent on whoever summarises the field for them, with all the risks that entails.

## 10. Why emergence is treated carefully

Emergence is real and important. Capabilities sometimes appear at thresholds rather than scaling linearly. Internal representations form that nobody programmed. Specialist behaviour shows up in mixture-of-experts routing.

Emergence is also abused. It is the most popular vector for mysticism, anthropomorphism, and breathless press coverage. "The model just understood" is almost never a useful explanation.

The course treats emergence mechanistically: optimisation over large parameter counts produces nonlinear capability thresholds, distributed representations, and unprogrammed structure. The phenomenon is real; the magical-thinking framing of it is not.

Emergence threads through the course: L4 (first pass), L43 (transformer block), L45 (MoE), L51 (scaling laws), L71 (reasoning), L72 (world models), L75 (revisited with mechanism). Each appearance reinforces the mechanistic frame and pushes back on the mystical one.

## 11. The anti-mysticism rule

A lesson never says "the model just understands." A lesson never says "it learns somehow." A lesson never invokes intelligence or consciousness as an explanation for behaviour we can otherwise explain.

If a behaviour is explained by gradients flowing through specific layers, the lesson explains it that way. If a behaviour is unexplained, the lesson says so, names it as open, and references current interpretability work.

The reader leaves with a habit: when faced with a confident anthropomorphic claim about an AI system, ask "what's the mechanism." If no mechanism is offered, the claim is folk-theory, not engineering.

## 12. The role of wonder

The anti-mysticism rule is a fence around explanation, not around feeling. The course is for people who showed up because something pulled them: science fiction, cybernetics, robotics, the speculative computing tradition, the engineering imagination that asks what minds could be made of. That pull is the fuel. The course preserves it.

The principle is simple: **wonder is encouraged. Mystification is not.**

Wonder is the response to an actual mechanism that turns out to be more interesting than expected. The first time you trace gradients back through 12 layers of a transformer and see the chain rule doing the whole job, that should feel astonishing. The first time you watch attention weights resolve onto the right token in a long sequence, that should feel astonishing. These are real moments and the course doesn't flatten them.

Mystification is a substitute for explanation. It is the move where "the model just learned to..." stops the sentence because the speaker doesn't know the mechanism and would rather not say so. The course rejects that move every time.

The distinction in practice:
- **Motivation vs explanation.** A sci-fi reference can frame why a topic matters ("Asimov imagined robots that planned; modern policy networks are a small piece of that vision"). A sci-fi reference cannot do the work of explaining what a policy network is.
- **Inspiration vs evidence.** A vision of long-term trajectories ("if optical compute scales, the interconnect picture changes completely") is welcome as a framing for why a substrate matters. It is not evidence that optical compute will scale.
- **Speculation flagged as speculation.** When a lesson reaches forward, it says so explicitly. "If" not "when." "A bet several labs are placing" not "the next paradigm."

The real mechanisms of modern AI are already astonishing. Gradients flowing through 70 billion parameters; attention patterns resolving without supervision onto syntactic structure; mixture-of-experts routing producing specialised behaviour from a uniform training signal; diffusion turning noise into images by learning to take small denoising steps. None of these need decoration to be remarkable. The course's job is to put the reader in close enough contact with these mechanisms that the astonishment happens on contact with reality, not on contact with rhetoric.

Sci-fi references, futurist framings, and inspirational analogies are allowed in lessons where they illuminate engineering constraints, systems design, or long-term trajectories. They are forbidden where they substitute for a mechanism or sneak in unsupported capability claims. The reader should leave a lesson with both: a clear mechanism, and an honest sense of where this might go. Neither alone is enough.

The discipline this asks of the author: do not be afraid of wonder. Be careful with mystification. The two are not the same, and a course that confuses them either becomes dry (refusing all imagination) or becomes sloppy (smuggling in folk theory dressed as insight). Neither is what we are building.

## 13. The anti-tool-worship rule

A lesson never says "use PyTorch because PyTorch." It says "this kind of problem needs autograd, and PyTorch is the current implementation that gives you autograd ergonomically."

Tools are means. They are chosen for a reason that should always be named. A reader who finishes the course can swap PyTorch for JAX, llama.cpp for vLLM, Pinecone for FAISS, and understand the trade-offs.

The course explicitly rejects three patterns:
- "X is the standard, learn X."
- "Y is the best, use Y."
- "Z is the future, switch to Z."

All three foreclose understanding. The substitute frame: name what the tool does, name what would replace it, name the trade-off.

## 14. The anti-framework-lock rule

Build-track milestones reach for numpy before frameworks. Frameworks enter only when autograd or ergonomics genuinely earn their keep. Hugging Face `Trainer` magic is forbidden in builds; loops are written by hand.

The point is not asceticism. The point is that a reader who has written backprop in numpy, attention in numpy, and a training loop by hand has built skills that survive any framework's deprecation. They can read a new framework's source code and understand it. They are not captive.

The course is built for someone who may be doing AI work in 10 years, on hardware and stacks that don't exist yet. Framework-locked skills don't transfer that far.

## 15. The role of the memory palace

71 lessons (78 in v2.1) are too many to hold without structure. The memory palace turns the lesson sequence into spatial memory.

The workshop metaphor was chosen for this reader specifically: hardware engineers see workshops without effort. The route is muscle memory. Anchoring abstract AI concepts to spatial locations bypasses the limitations of pure semantic memory.

The palace is not a gimmick. It is a load-bearing piece of the retention design. Walking the route weekly is the consolidation step that converts recent memory into structural memory.

When you can walk all 79 stations cold, naming the concept and its key claim at each, the course has succeeded.

## 16. The role of retrieval and interleaving

Reading a lesson once and feeling like you understand it is a known failure mode. The brain confuses recognition with recall.

Each lesson ends with retrieval practice: open-ended questions, answered without looking, then checked. The third question always pulls back to an earlier lesson, so the reader cannot let prior concepts go stale.

Each lesson also ships with 10-15 spaced-repetition flashcards. Atomic facts, scheduled by an SM-2-style algorithm. Daily review.

Together: retrieval forces effort, interleaving distributes the cognitive load, spaced repetition ensures the facts survive months. These are not optional accessories. They are the reason the course can claim to produce understanding that sticks.

A lesson without a working retrieval section is not finished.

## 17. The desired learner outcome

A learner who completes the course should be able to:

- Look at any modern AI system and identify which hardware constraints shaped it.
- Read a frontier ML paper, identify the load-bearing claim, and tell whether the experiments support it.
- Sketch a transformer from memory and explain each component's purpose and cost.
- Implement backprop, attention, and a small transformer from scratch.
- Choose, justify, and operate a local inference rig with no cloud dependency.
- Identify whether a "production AI system" is generative, discriminative, or a stack of both.
- Apply RL framing to sequential decision problems, and recognise where it does and does not fit.
- Distinguish capability emergence from measurement emergence from anthropomorphic projection.
- Survive 5 years of vendor churn without losing the ability to reason about the field.

The course succeeds if these are durable abilities, not memorised facts.

## 18. Rules future lesson authors must obey

When generating a lesson (whether drafted by ChatGPT, refined by Claude, or written by hand), the following rules are not negotiable.

1. **Lead with why.** Every lesson opens with the constraint or problem that produced its topic, not the topic itself.
2. **Intuition before formalism.** Geometric pictures, worked examples, and verbal explanations precede any symbolic math.
3. **Hardware grounding where relevant.** When the topic depends on compute, memory, or bandwidth constraints, name them.
4. **History where it teaches.** When current methods are survivors of a chain, name the previous attempts and what they failed to solve.
5. **No mysticism.** No "the model just understands." Every claimed behaviour gets a mechanism or an honest "open question."
6. **No vendor scaffolding.** Vendors appear as examples. The lesson teaches the category.
7. **No banned phrases.** See `handoff-format.md` for the list. Negative-parallelism reframes are forbidden.
8. **Retrieval is mandatory.** 3 open-ended questions, one interleaved to an earlier lesson per the dependency graph.
9. **Flashcards are atomic.** 10-15 per lesson, one fact per card, testable.
10. **Diagrams teach.** Every diagram in a lesson must do explanatory work. Decorative figures are removed.
11. **Through-lines named.** When a lesson touches an established through-line (emergence, RL, hardware, history), the connection is made explicit, not left implicit.
12. **Date-stamp the frontier.** Phase 7B lessons carry an explicit "as of [year]" notice in the lesson body.
13. **Speculative framing may motivate, not explain.** Sci-fi references, futurist analogies, and inspirational framings are welcome where they illuminate engineering trade-offs or long-term trajectories. They never substitute for a mechanism or a causal account. When a lesson reaches forward, it says so explicitly: "if," not "when."

A lesson that violates any of these is unfinished. It does not ship.

## 19. The full compute spectrum

AI is not only giant cloud models. The mechanisms taught in this course (representation, optimisation, geometry, scaling) are the same across the full compute spectrum: microcontrollers, embedded systems, edge devices, laptops, workstations, gaming GPUs, server clusters, hyperscale data centers. What changes is the constraint set.

The course teaches AI as a constraint-aware engineering discipline. The recurring question, applied across phases, is *how does this change under severe constraints?* That question lands differently on a microcontroller, a phone, a workstation, and a data center, and the lesson should make the difference visible where it earns its place.

Specifically:

- Hardware lessons (Phase 3) cover the spectrum's tiers explicitly. CPU, GPU, TPU, NPU and edge inference. Quantisation as the lens that makes the spectrum traversable.
- Architecture and training lessons (Phase 4-5) carry callouts where there's a meaningful small variant (CNN at the edge, distilled transformer, MoE in distributed compute, hyperscale pretraining).
- Deployment lessons (Phase 6) live mostly at tier 2 and tier 3 (consumer GPU and self-hosted server), with edge inference and on-prem treated explicitly.
- Synthesis lessons run the spectrum walk: how does this phase's content land at tier 0, tier 2, tier 4?

This is the same anti-vendor principle from section 8, applied at the substrate level. The course is for someone who may be doing AI work on hardware that hasn't been invented yet. The full-spectrum lens prepares for that.

## 20. Retention engineering as architecture

The course is engineered for long-term retention, not short-term exposure. The memory palace, retrieval practice, glossary tooltips, interleaving, synthesis lessons, calibration assessments, recurring core laws, contrast pairs, and progressive diagram evolution are all load-bearing pieces of one architecture.

This is the explicit commitment: a lesson that the reader will forget in 3 months is not a finished lesson. The retention apparatus is not garnish around the teaching. It is part of the teaching.

The 5 recurring core laws are:

- Representation shapes computation.
- Optimisation shapes capability.
- Hardware shapes architecture.
- Geometry enables generalisation.
- Constraints shape systems.

These get unpacked where they first land and called back in every subsequent synthesis lesson. They're the load-bearing connections the course is built around.

Specifically:

- Synthesis lessons (S1-S7B) compress each phase into its connections. No new mechanisms.
- Calibration assessments (C1-C7B) test mechanism rather than trivia. The reader gates their own progress through them.
- Mechanism / compression boxes surface load-bearing causal chains inline where a paragraph hides them.
- Contrast pairs (interpolation vs extrapolation, memorisation vs generalisation, dense vs sparse, etc.) get explicit treatment where the contrast carries explanatory load.
- Diagrams evolve: the L1 system loop is the seed, the transformer block is the same loop instantiated, the training loop is the same loop with feedback to parameters.

A lesson that ignores the retention architecture is structurally weaker even if its content is correct.

---

## A note on revisions

This document is durable, not immutable. If a future principle emerges that genuinely belongs here, it can be added by careful amendment, with the change history noted at the bottom.

Drift is the failure mode. The hardest thing about a long curriculum is not building it. It is keeping it from softening over time as small compromises accumulate. The Constitution is the explicit anti-drift safeguard.

Read it before you write a lesson. Re-read it when a lesson feels slightly off. Push back when a draft starts to violate it, even when the violation is small.

---

*Anchored to v3. Last reviewed: 2026. v3 added sections 19 (full compute spectrum) and 20 (retention engineering as architecture); see `refinements-v3.md` for the amendment rationale.*