L34The perceptronThe original neuron. Linear separability, the XOR wall, the 1969 collapse and its lesson.
L35MLPs and universal approximationA stack of perceptrons with a nonlinearity can approximate anything. The catch is data and compute.
L36BackpropagationThe chain rule applied at scale. Why it took until the 1980s to land and the 2010s to scale.
L37Convolutional netsLocality and weight sharing. Why CNNs dominated image tasks for a decade.
L38Recurrent netsLoops in the architecture, state across time.
L39Vanishing gradients and LSTMsWhat broke about RNNs and how gating mostly fixed it.
L40AttentionSoft lookup over the past, learned weighting. The mechanism that unstuck long-sequence modelling.
L41Multi-head attentionDifferent heads learn different relations.
L42Positional encodingThe transformer doesn't know order by default. Sinusoidal, learned, RoPE, ALiBi.
L43The transformer blockAttention + feed-forward + residual + norm. The whole architecture is stacked copies of this.
L44DiffusionA different idea entirely: gradually denoise.
L45Mixture of expertsConditional compute. Many small experts, route to a few per token.
L46Multimodal architecturesCLIP-style alignment, vision encoders feeding LLMs, audio in.
L47Discriminative architecturesNot everything generates. Classifiers, rankers, two-tower retrievers; the other half of the field.
S4Phase 4 synthesis · all 14 sketches in orderEach architecture as a response to the previous architecture's limit, evaluated against the hardware available at the time.
C4Phase 4 calibrationArchitecture analysis weighted: identify limit response, compare under named budget, diagnose long-context failure. Gate to Phase 5.